Five ITOps best practices to stay ahead during major third-party outages

5 min read

by Adam Blau | Nov 19, 2025

When external providers fail—whether it was CrowdStrike outage last year, AWS outage last month, or the Cloudflare DNS outage yesterday—the symptoms inside your environment often look like internal issues: timeouts, login failures, API errors, service degradation, or sudden spikes in dependency-related alerts.

It’s natural for teams to start searching through their own infrastructure first, but none of these symptoms clearly point to your systems as the root cause. That ambiguity can delay resolution, confuse end-users, and overwhelm service desks.

However, enterprises using BigPanda AI Detection and Response and AI Incident Assistant receive more than just signals from internal systems; they gain what we call external observability.

External observability provides real-time visibility into outages happening outside your organization, such as DNS disruptions, CDN failures, SaaS incidents, network instability, or widespread internet health events. It then correlates them with the symptoms appearing in your environment.

Sample summary of BigPanda external observability findings for the November 2025 Cloudflare outage

This context doesn’t replace internal monitoring and observability tools; it simply provides teams with the clarity to respond more quickly when the world around them breaks. Most importantly, external observability helps teams recognize the difference between “something is broken here” and “something is broken out there,” which dramatically reduces confusion and panic internally across the business.

Below, we outline five best practices BigPanda customers rely on during third-party outages, and how they apply during external disruptions, such as the November 2025 Cloudflare DNS event.

Best practice 1: Control the flood of outage-related alerts

For example, at the time of the faulty CrowdStrike update, some BigPanda customers had hundreds or thousands of hosts fall offline. Alerts came from impacted systems, interrelated systems, and online systems that absorbed the resulting additional load. Response teams faced a flood of alert noise and the challenge of sifting through it to find relevant, actionable information.

Organizations that had previously deployed robust alert correlation had a distinct advantage. These teams used BigPanda to correlate the flood of “host not reporting” and related service alerts into fewer, clearly articulated incidents. Our customers shared that BigPanda alert filtering and correlation were instrumental in managing the volume, providing critical data, and helping teams prioritize and resolve issues efficiently.

Best practice 2: Rapidly identify incidents tied to impacted hosts

The CrowdStrike outage last year was caused by an update, so response teams faced the challenge of identifying which incidents were related to the update and which were not. Due to the volume of incidents associated with the outage, there was a risk of losing track of other non-CrowdStrike-related incidents amid the chaos.

BigPanda enriches incidents with insightful information, such as auto-generated titles and summaries, and, importantly, the suspected root cause. With this detail, responders could immediately identify which incidents were related and address them appropriately. As a result, non-CrowdStrike-related incidents remained visible and actionable, even when other incidents threatened to overwhelm operations teams.

Best practice 3: Create a list of all impacted hosts or services

During the CrowdStrike incident, some of our customers used BigPanda to provide a consolidated list of hosts showing outages. They achieved this by searching for incidents based on enrichment tags (unified search) and using unified analytics, along with support from some BigPanda employees who jumped into action to assist.

When a third-party provider experiences an outage, internal systems can begin to fail in various ways. BigPanda customers use unified search and analytics to:

Build consolidated lists of affected resources.
Determine true scope.
Track progress on recovery.
Visualize impact in real time with dashboards.

This clarity helps teams focus on mitigation rather than detection.

Best practice 4: Manage the volume of created tickets

Outages involving external dependencies can generate a substantial increase in downstream ticket volume. As application timeouts, retries, or degradations occur unpredictably, every service dependency can generate its own alert and corresponding ticket.

Customers who effectively used alert correlation and workflow automation had the best experience. Workflow automation generates fewer, cleaner tickets from correlated incidents, promptly routing them to the right teams with the necessary context about the nature of the incident to expedite remediation with less wasted effort chasing symptoms.

The November 2025 Cloudflare DNS outage showed again that well-tuned correlation and automation help the entire operations process scale, even when external services create cascading internal noise.

Best practice 5: Perform post-event analysis

After outages, customers consistently use BigPanda to quantify impact and improve readiness. Teams use unified analytics and problem management to extract insights like:

Volume and timing of external-dependency incidents
Which internal services were most affected
Ticket load and response times
Time to detection vs. time to awareness

This reporting enables leaders to understand the business impact, prepare executive summaries, and strengthen response strategies for future outages.

Final thoughts

Outages caused by external factors, such as software updates, DNS disruptions, CDN failures, or global internet issues, remind us that even the strongest internal infrastructure is only as resilient as the external ecosystem it relies on.

BigPanda customers consistently demonstrate remarkable professionalism, clarity, and creativity in their responses. Their best practices help them stay ahead of chaos, communicate confidently with the business, and restore services faster, especially when the root cause lies entirely outside their control. At BigPanda, we’re privileged to support teams that keep the digital world running.

If your organization wants better visibility into external dependencies and faster clarity when the next outage occurs, and it certainly will, BigPanda can help you achieve this. Let’s talk.

Five ITOps best practices to stay ahead during major third-party outages

Best practice 1: Control the flood of outage-related alerts

Best practice 2: Rapidly identify incidents tied to impacted hosts

Best practice 3: Create a list of all impacted hosts or services

Best practice 4: Manage the volume of created tickets

Best practice 5: Perform post-event analysis

Final thoughts

See what BigPanda can do

How to lay the data foundation to support agentic ITOps

What is event correlation?

6 use cases for agentic AI in major IT incident management

Five ITOps best practices to stay ahead during major third-party outages

Best practice 1: Control the flood of outage-related alerts

Best practice 2: Rapidly identify incidents tied to impacted hosts

Best practice 3: Create a list of all impacted hosts or services

Best practice 4: Manage the volume of created tickets

Best practice 5: Perform post-event analysis

Final thoughts

See what BigPanda can do

Check out our latest posts

How to lay the data foundation to support agentic ITOps

What is event correlation?

6 use cases for agentic AI in major IT incident management