Speed change-alert discovery and incident resolution

4 min read
Time Indicator

Today, the majority of organizations operate under a hybrid cloud structure. Due to this, operations are consistently met with daily infrastructure and software changes and updates, which are also the primary cause of incidents and outages. Long gone are the days when a tech stack could be represented by a single dependency model. Microservices, CI/CD, and containers across multi-cloud make it extremely difficult to track all the changes and connect them to incidents.

“Before BigPanda, there was a lot of manual effort to our process and very low visibility into what was happening across our broader teams,” shared Mark Peterson, SPV IT Operations of Cambia Health Solutions, during BigPanda’s healthcare customer testimony webinar. “When we would experience an outage, it would be hours of manual intervention just to clean up the high influx of alerts that came in. We could have outages where we would receive a thousand alerts within a three-minute window, which would actually completely crash our webpage. So our NOC was completely blind during those incidents, and it negatively impacted our MTTR to rely on manual efforts.”

Legacy event management, monitoring, and modern observability tools are not configured to collect and analyze data from different change tools at the same time an incident occurs to accurately automate probable root causes. Instead, IT teams using these tools would typically analyze causality as a part of a post-mortem following a high-priority outage or major incident. These analyses, also called Root Cause Analyses (RCA), are done to determine what happened after the fact, why, and how to avoid similar disruptions in the future.

While post-incident investigations can provide some level of value, they are not helpful in real-time. An average of 85% of incident-impacting alerts are a direct result of changes within the IT environment, causing significant toil at the time of the incident. “Change-related incidents are one of the biggest generators of unnecessary alert noise,” agreed Peterson.

Without a real-time, accurate means to get to the root cause of an incident as it occurs, the bulk of incident-impacting alerts caused by changes will continue to be the source of elongated, manual toil and growing MTTR rates. BigPanda’s latest generation of Root Cause Changes offers an intelligent solution to combat these challenges.

High-confidence insights into statistically relevant change data

Our Root Cause Changes (RCC) feature employs advanced AI for the automated identification of incident-impacting change data. Incorporating newly enhanced dimensions, categories, and confidence matching supports a more refined RCC AI algorithm that prompts far more relevant change tags across customer deployments. This refinement enables a user-validated correlation of real-time change data within an incident, ensuring users receive a high-confidence causality ranking of suspected change alerts linked to an incident. This approach provides statistical precision and confidence, ultimately making incident triage faster and more efficient.

Uncover change data linked to IT incidents in real-time

Pragmatic AI correlates incident alerts with change data for incident cause identification within seconds. This gives ITOps, DevOps, and SRE teams fast, precise root cause identification after an incident occurs, leading to an MTTR reduction rate of up to 50%. RCC automatically synthesizes this complex alert data into clear, crisp incident summaries generated in natural language. When teams can quickly understand the incident impact and interpret probable root cause within moments, elongated manual toil is eliminated, and incident resolution can be accomplished faster than ever before.

Advanced analytics for improved environment insights at-a-glance

The new Unified Analytics RCC dashboard enables users to measure, improve, and operationalize root cause change investigation for all applications and services. These interactive dashboards display change tag details, total alerts, and incidents, facilitating the optimization of out-of-the-box RCC configurations and operational enhancements. This real-time visibility empowers ITOps, L2, L3, and response teams by providing them with reportable change information to not only guide their actions during incident triage but also to provide insights for improved operational workflows.

Get to the root cause faster with Root Cause Changes

BigPanda Root Cause Changes is the solution for IT teams to correlate ITOps incidents with the changes that caused them. Customers are expressing interest in using these features as soon as possible, with Cambia’s Peterson sharing, “We’ll use BigPanda’s Root Cause Changes tool to gain a clearer understanding of the underlying changes causing incidents so we can respond more effectively.”

Discover more about Root Cause Changes and the importance of getting to the root cause quickly in our webinar, Real-time root cause analysis in the age of hybrid cloud complexity. You can also check out our e-book, Three Ways to Simplify Root Cause Discovery, or review our Root Cause Changes data sheet for more information.