How automated root-cause analysis can help reduce MTTR

6 min read
Time Indicator

Finding the root causes of IT anomalies can be challenging, but the rewards are worth it. By identifying the root cause or causes of an incident or critical failure, response teams can resolve incidents faster and determine the best steps to avoid having them recur. This can drive down both the frequency of service interruptions and their duration. However, there is an essential distinction between root cause investigation after an incident and the investigation that occurs while an incident is still active.

Move beyond retrospective root-cause analysis

Most people first experience root-cause analysis (RCA) as a part of an incident’s post-mortem. With retrospective RCA, teams review a specific high-priority outage to determine how to avoid another similar outage. Because the original incident is resolved, teams have the luxury of time to research not just what happened but dig deeply into why it happened and how to improve processes and systems to mitigate future risk.

But when something like a costly service outage occurs, rapid response and remediation rely on finding root cause quickly. Response teams must identify the details of what happened, why it happened, and suggest a course of action, all within minutes. Fast response requires RCA in real-time.

How real-time root-cause analysis reduces MTTR

Real-time RCA, identifying incident root causes in minutes or seconds, is essential when working to reduce incident mean-time-to-resolution (MTTR). Changes, system failures, and other root causes, when identified, are often easy to fix, shortening MTTR and avoiding minutes or hours of costly downtime. Finding root causes quickly makes a real difference.

So, if it’s valuable to understand the root cause of an incident right away, why is it not a common practice? The answer lies in how difficult and resource-intensive it can be. To find root cause in real-time response teams must sift through event data from multiple observability tools – the average BigPanda customer uses 21 unique tools, each focused on one aspect of the infrastructure stack. But, because most outages are related to changes, change data is a vital part of real-time RCA. Other data sources, such as historical incident data and topology data, can also be instrumental in determining root cause.

BigPanda helps us detect incidents and uncover probable root cause in real time, which has significantly reduced our MTTR by 78%, from 25 hours to 5.5 hours per incident." Michael Lorenzo, FreeWheel

“BigPanda helps us detect incidents and uncover probable root cause in real-time, which has significantly reduced our MTTR by 78%, from 25 hours to 5.5 hours per incident.”

– Michael Lorenzo, Senior Director of Operations for the Global NOC, FreeWheel

At BigPanda, we’ve found the first step in accelerating root cause analysis and speeding up MTTR is to organize incident data in a clear, understandable way and present that data to response teams quickly and efficiently. Tools like the BigPanda Incident Timeline, for example, allow human operators to instantly identify the first alert in a causal chain that resulted in an outage, tracing an outage back to the first symptom of a problem and the likely root cause. Nevertheless, because of the volume of data to analyze and the time human response teams need to analyze it manually, real-time root cause investigation increasingly relies on artificial intelligence and machine learning to automate the process.

BigPanda-RCA-blog

What is automated root-cause analysis?

Automated RCA is the process of using automation to investigate incident root causes in real time using AI/ML. BigPanda uses machine learning and artificial intelligence, including Generative AI, to reveal causal relationships behind incidents with surprising speed and accuracy. Instrumental in AI Root Cause Analysis, however, is access to a robust, enriched dataset. Our testing shows that AI/ML outputs are twice as accurate and reliable when the underlying algorithms have access to the clean, organized data available in the BigPanda system.

Customer quote: "With BigPanda, we've automated our alert process by 83%, enabling root cause identification of critical alerts within 30 seconds." Mark Peterson, SPV ITOps, Cambia Health Solutions

“With BigPanda, we’ve automated our alert process by 83%, enabling root cause identification of critical alerts within 30 seconds.”

– Mark Peterson, SPV IT Operations, Cambia Health Solutions

Finding root cause changes with BigPanda

Because most outages are the result of some change or changes, analyzing change data in the context of an incident can reveal underlying causality with a high degree of accuracy. BigPanda is pre-integrated with common change systems such as ServiceNow, JIRA, Jenkins and CloudTrail, along with a powerful Changes REST API, connects to change feeds and tools, and aggregates their data for analysis. BigPanda then uses machine learning to correlate individual changes with incident data and suggest changes most likely to be the root cause. This feature, called Root Cause Changes, is tunable to help response teams hone in on the right causal factors quickly in a way that best suits their business.

The results of leveraging AI with change data are evident. RCA can occur in real-time or near-real-time, with results available in moments for teams to take action. One customer using the latest generation of RCC saw a dramatic decrease in the duration of incidents – slashing MTTR by 50% within the first two months of deployment.

How to automate root cause with the power of generative AI

Despite the hype surrounding generative AI, this emerging technology can provide significant value to teams automating RCA. When supplied with relevant data in context, generative AI and the underlying large language model (LLM) algorithms can successfully compare an individual incident’s data with a vast database of prior IT incidents used in their training.

BigPanda has focused generative AI efforts on automating incident analysis to automate incident summarization, including incident impact and root cause. Our users report AI-suggested root causes are more accurate than most human first responders. BigPanda Generative AI for Automated Incident Analysis has also resulted in much faster triage times, with some organizations slashing triage time in half.

Get started with AI and automated root cause

With AI-suggested root causes proving more accurate than most human responders and significantly reducing MTTR, the future of RCA is firmly rooted in automation and artificial intelligence.

As we move towards this data-rich and fast-paced era of IT operations, the ability to swiftly and accurately pinpoint root causes isn’t just a competitive advantage; it’s a necessity. Discover why and how companies are transforming their root cause automation in our ebook Three Ways To Simplify Root Cause Discovery or our webinar, Real-time root cause discovery in the age of hybrid cloud complexity. And with BigPanda, you’re not just keeping up with the times; you’re shaping the future of incident analysis and resolution.

Customer quote: "If we are unable to identify and understand the root cause of an incident, we are at a tremendous disadvantage. With BigPanda, we are now taking advantage of machine learning automations and AI to further decrease the mean time to identify an incident." Alvin Smith, Vice President of InterContinental Hotels Group

 

“If we are unable to identify and understand the root cause of an incident, we are at a tremendous disadvantage. With BigPanda, we are now taking advantage of machine learning automations and artificial intelligence to further decrease the mean time to identify an incident, which in turn gives us more time back to resolve the operational incident, reducing our MTTR and keeping our services running.”

– Alvin Smith, Vice President, Global Infrastructure and Operations, InterContinental Hotels Group (IHG)