Root Cause Analysis, powered by AIOps

Automated Root Cause Analysis to speed up IT incident resolution

Identifying the root cause of an outage or a poorly performing application is one of the biggest challenges that IT organizations face today.

In the past, low-level hardware and infrastructure issues caused outages, which were easily identifiable. Today, with the proliferation of tools, problems have moved up the stack, to complex application architectures, databases and the cloud. Modern IT environments can experience thousands of changes every week and each change has the potential to cause unintended outages or disruptions.

Without an automated system to detect the root cause, IT teams have to go through the grueling and time-consuming process of manually sifting through hundreds of thousands of IT alerts and changes to find the answer.

There is good news. The days of manually sifting through applications to determine the cause of outages are over. BigPanda is purpose built to root out the cause through artificial intelligence – (AIOps).

In a November 2019 market report, Gartner Research estimated the AIOps platform market at $300 million to $500 million a year. They predicted that by 2023, 40 percent of DevOps teams would add AIOps platforms to their toolset.

Root Cause Analysis

BigPanda’s Root Cause Analysis, built for modern, complex IT environments

BigPanda’s Root Cause Analysis (RCA) capability uses Open Box Machine Learning to help organizations identify changes in infrastructure and applications that cause the majority of today’s incidents and outages. Additionally, BigPanda identifies low-level infrastructure issues that cause problems.

By harnessing the power of AIOps, BigPanda is able to pinpoint the root cause of incidents and outages in real-time. The result is tangible in both revenue and SLAs, by helping IT Ops rapidly resolve incidents and outages.

The challenges of manual root cause analysis

Complex infrastructure + legacy tools + pace of change

Today’s IT environment is only becoming more complicated and multi-layered. Systems fail. Developers are stressed. Customers are upset. Revenue is lost.

  • The IT stack has become incredibly diverse and convoluted due to the proliferation of tools, making traditional dependency-driven RCA ineffective.
  • CMDBs no longer hold the answer through their once revered dependency trees, as today’s modern environments contain a plethora of microservices, communicating with a variety of databases like MongoDB, PostgreSQL, Snowflake and S3, and running on elastic clouds and container clusters.
  • This new environment has also created an order-of-magnitude increase in the pace of change, thanks to practices such as continuous delivery and Infrastructure as Code (IaC), with organizations experiencing thousands of changes every week.

Finding root cause is complex

Root cause analysis often requires piecing together information from multiple sources like log management, APM or tracing. This process requires operators to navigate between different tools while trying to identify relevant information such as runbook links. In many cases, operators are not even aware that such information exists in other tools.

No real-time topology models

Without an up-to-date, full-stack, real-time topology model, organizations are not able to correlate their incidents and identify the probable root cause of those incidents with a high degree of accuracy. This limits their ability to help organizations rapidly investigate and resolve incidents before they escalate into crippling outages.

Detecting patterns across alert storms is hard

When an incident starts to form, it is often manifested in tens or even hundreds of alerts that are generated around the same time. Without inspecting each individual alert, it is difficult to figure out the common denominator of the alerts. The common denominator is often the root cause of the issue, which means teams waste a lot of time investigating the incident.

Historical context is missing

There is often a distinct order of events, which leads to an incident or outage. When simply aggregating alerts received in a specific window of time, those clues get lost, making it nearly impossible for IT Operations teams to visualize the progress of an incident over time. This lack of historical context makes it difficult to prevent similar incidents in the future.

With BigPanda Root Cause Analysis, operations teams can eliminate long bridge calls that tie up high-value experts. This frees those experts up to work on strategic initiatives for the organization.

How Automated Root Cause Analysis works

Root Cause Analysis Screenshot

Root Cause Analysis surfaces the problem change right alongside the incident

Once integrated with all change feeds/tools, BigPanda aggregates change data (new changes and updates to changes) and normalizes them. Then, BigPanda’s Open Box Machine Learning technology analyzes these changes against existing incidents in real-time, to identify and surface root cause changes alongside that incident.

Topology-based Root Cause Analysis increases accuracy of finding the probable cause

BigPanda’s Real-Time Topology Mesh creates a full-stack, real-time topology model that captures dependencies between networks, servers, clouds and applications. BigPanda’s Open Box Machine Learning technology then correlates monitoring alerts against this topology model and surfaces the probable root cause of incidents with a high degree of accuracy.

Topology Cluster
Correlation Evolution

Dynamic incident titles display probable root cause at a glance

BigPanda surfaces the common denominator of incidents, often the root cause of incidents and outages, in real-time, and displays the probable root cause of an incident within its title. As new alerts are collected and added to the incident, BigPanda dynamically updates the incident title. With dynamic incident titles, operations teams always have access to the latest, up-to-date probable root cause.

Incident timelines show the evolution of an incident over time

To help operations teams understand when an incident started and how it evolved, BigPanda’s Incident 360 Console provides an Incident Timeline view. The Incident Time shows when each alert associated with the incident occurred and in what order, so users can trace the probable root cause more quickly and resolve the incident faster. BigPanda’s Incident 360 Console makes it easy for teams to visualize an incident’s evolution over time.

Deep link feature displays Root Cause Analysis insights in any tool or dashboard

BigPanda was designed to provide easy access to Root Cause Analysis insights from other domain-specific tools using the Deep Links feature. The Deep Links feature turns BigPanda into an intelligent gateway for operational context and can link to metadata, including root cause info collected from other systems or tools. With deep links, relevant dashboards in other monitoring tools, related searches log management tools, or related runbook articles in knowledge bases are just one click away. This boosts Level 1 resolution rates and slashes mean time to repair.

Find out how BigPanda’s Root Cause Analysis supports modern IT Operations.

Building the business case for Root Cause Analysis powered by AIOps

To build a business case, quantify the negative consequences of having to manually determine the root cause of incidents. Here are some questions enterprises commonly use to quantify the status quo:

What is the value of each minute spent on your root cause analysis when you don’t have automated RCA in place?

  • How many FTEs does your IT Ops or NOC team have (across all the shifts, and across your global locations)?
  • What are their salaries and fully loaded costs?
Quantify the incidents and outages you experience daily or weekly across the P3, P2, P1 and P0 (or Sev3, Sev2, Sev1 and Sev0) categories.

  • What is the time your team spends investigating each of these outages trying to find the root cause?
  • What does that cost?
  • If your team holds bridge calls to investigate these incidents, how many people are on these calls?
  • How long do they last?
  • What is the cumulative cost of the tens of thousands of working hours of bridge calls annually?
What are the non-human cost of these outages to critical business systems, such as:

  • Revenue-generating systems that are not generating revenue?
  • Point of sale (POS) systems
  • Payment processing services
  • SLAs that are violated

So, what are you waiting for?