Root Cause Analysis, powered by AIOps

How real-time Root Cause Analysis speeds up incident investigation

Identifying the root cause of an outage or poorly performing application is one of the biggest challenges for enterprise IT Ops, NOC, DevOps and SRE teams.

That’s because low-level hardware and infrastructure issues that caused outages in the past are no longer the main problem. Problems have now moved up the stack, to complex application architectures, databases, clouds and their inter-dependencies. And because these modern IT environments experience thousands of changes every week, each change has the ability to cause an unintended outage or disruption.

Root Cause Analysis

Without Root Cause Analysis (RCA) techniques that are built for these modern IT environments, teams must go on a scavenger hunt, manually and slowly sifting through hundreds of thousands of IT alerts and thousands of changes to triangulate on the root cause.

BigPanda’s Root Cause Analysis, built for modern, complex IT environments

BigPanda’s Root Cause Analysis capability uses Open Box Machine Learning to help organizations identify changes in infrastructure and applications that cause the majority of today’s incidents and outages.

In addition, BigPanda identifies low-level infrastructure issues that cause problems.

By pinpointing the root cause of incidents and outages in real-time, BigPanda helps organizations and their operations teams rapidly investigate and resolve those incidents and outages.

The challenges with finding root cause today

Complex infrastructure + legacy tools + pace of change

In this era of fast-moving IT, the root cause of incidents and outages has moved “up the stack,” from infrastructure problems (such as misconfigured routers, faulty power supplies and corrupt storage arrays) to applications, databases, cloud dependencies, and the users.

  • This layer of the stack has become incredibly diverse and convoluted, making traditional dependency-driven RCA ineffective
  • CMDBs can no longer hold the answer with their once revered dependency trees, as today’s modern environments contain a plethora of microservices, communicating with a variety of databases (the likes of MongoDB, PostgreSQL, Snowflake and S3), and running on elastic clouds and container clusters.
  • This new environment also means an order-of-magnitude increase in the pace of change, thanks to practices such as continuous delivery and Infrastructure as Code (IaC), with organizations experiencing thousands of changes every week.

Finding root cause is complex

Root cause analysis often requires piecing together information from multiple sources (e.g. log management, APM, tracing, etc.). This requires operators to inefficiently navigate between different tools, trying to identify relevant information such as runbook links. In many cases, operators are not even aware that such information exists in other tools.


No real-time topology models

Without an up-to-date, full-stack, real-time topology model, organizations are not able to correlate their incidents and identify the probable root cause of those incidents with a high degree of accuracy. This limits their ability to help organizations rapidly investigate and resolve incidents before they escalate into crippling outages.

Changes occurring at break-neck speed

When an incident starts to form, it is often manifested in tens or even hundreds of alerts that are generated around the same time. But it is often difficult to figure out the common denominator of these alerts (e.g. application, database, switch, storage array, etc.) without inspecting each individual alert. Because this common denominator is often the root cause of the issue, teams waste a lot of time investigating the incident.


Historical context is missing

There is often a very distinct order of events, which can get lost by simply aggregating alerts received in a specific window of time. It can be nearly impossible for IT Operations teams to easily visualize the progress of an incident over time, it’s hard for them to understand when an incident began, and how it evolved over time.

With BigPanda Root Cause Analysis, operations teams can eliminate long-lasting bridge calls that tie up high-value experts. This frees those experts to work on more strategic initiatives for the organization.

 

How real-time Root Cause Analysis speeds up incident response

Root Cause Changes (aka change-based RCA)


Organizations with dynamic infrastructure (cloud-native and hybrid IT stacks) experience more changes than ever before, and changes are the number one cause of incidents and outages. But there is no single source of truth for changes, because many changes are not tracked in change management tools. Instead, they’re tracked in continuous integration/continuous deployment tools and others. Because of that, and because these changes are not correlated against incidents, it’s very hard for customers to figure out what change caused which incident.

Topology-based RCA


Without an up-to-date, full-stack, real-time topology model, organizations are not able to correlate their incidents and identify the probable root cause of those incidents with a high degree of accuracy. This limits their ability to help organizations rapidly investigate and resolve incidents before they escalate into crippling outages.

Dynamic Incident Titles


When an incident starts to form, it is often manifested in tens or even hundreds of alerts that are generated around the same time. But it is often difficult to figure out the common denominator of these alerts (e.g. application, database, switch, storage array, etc.) without inspecting each individual alert. Because this common denominator is often the root cause of the issue, teams waste a lot of time investigating the incident.

Visualizing Incident Evolution


IT incidents usually materialize in the form of many symptoms across monitoring systems, and there’s often a very distinct order of events. That’s why simply aggregating alerts received in a specific window of time is not enough. Because it’s difficult for IT Operations teams to easily visualize the progress of an incident over time, it’s hard for them to understand when an incident began, and how it evolved over time.

Key IT Operations information is scattered across tools


Root cause analysis often requires piecing together information from multiple sources (e.g. log management, APM, tracing, etc.). This requires operators to inefficiently navigate between different tools, trying to identify relevant information such as runbook links. In many cases, operators are not even aware that such information exists in other tools.

How BigPanda Root Cause Analysis works

Root Cause Analysis Screenshot

Root Cause Analysis surfaces the problem change right alongside the incident

Once integrated with all change feeds/tools, BigPanda aggregates change data (new changes and updates to changes) and normalizes them. Then, BigPanda’s Open Box Machine Learning technology analyzes these changes against existing incidents in real-time, to identify and surface root cause changes alongside that incident.

Topology-based Root Cause Analysis increases accuracy of finding the probable cause

BigPanda’s Real-Time Topology Mesh creates a full-stack, real-time topology model that captures dependencies between networks, servers, clouds and applications. BigPanda’s Open Box Machine Learning technology then correlates monitoring alerts against this topology model and surfaces the probable root cause of incidents with a high degree of accuracy.

Topology Cluster
Correlation Evolution

Dynamic incident titles display probable root cause at a glance

BigPanda surfaces the common denominator of incidents, often the root cause of incidents and outages, in real-time, and displays the probable root cause of an incident within its title. As new alerts are collected and added to the incident, BigPanda dynamically updates the incident title. With dynamic incident titles, operations teams always have access to the latest, up-to-date probable root cause.

Incident timelines show the evolution of an incident over time

To help operations teams understand when an incident started and how it evolved, BigPanda’s Incident 360 Console provides an Incident Timeline view. The Incident Time shows when each alert associated with the incident occurred and in what order, so users can trace the probable root cause more quickly and resolve the incident faster. BigPanda’s Incident 360 Console makes it easy for teams to visualize an incident’s evolution over time.

Deep link feature displays Root Cause Analysis insights in any tool or dashboard

BigPanda was designed to provide easy access to Root Cause Analysis insights from other domain-specific tools using the Deep Links feature. The Deep Links feature turns BigPanda into an intelligent gateway for operational context and can link to metadata, including root cause info collected from other systems or tools. With deep links, relevant dashboards in other monitoring tools, related searches log management tools, or related runbook articles in knowledge bases are just one click away. This boosts Level 1 resolution rates and slashes mean time to repair.

Find out how BigPanda’s Root Cause Analysis supports modern IT Operations.

Building the business case for BigPanda Root Cause Analysis

To build a business case, quantify the manual effort and time lost in trying to understand the root cause of incidents. Here are some areas enterprises should examine and quantify in their environment:

How many FTEs does your IT Ops or NOC team have (across all the shifts, and across your global locations)? What are their salaries and fully loaded costs? This information helps you understand the value of each minute they spend on IT Ops alerts, incidents and outages.
How many incidents and outages do you experience daily or weekly across the P3, P2, P1 and P0 (or Sev3, Sev2, Sev1 and Sev0) categories? What is the time your team spends investigating each of these outages trying to find the root cause? What does that cost? If your team holds bridge calls to investigate these incidents, how many people are on these calls? How long do they last? What is the cumulative cost of the tens of thousands of working hours of bridge calls annually?
What is the non-human cost of these outages on critical business systems that might suffer downtime or revenue-generating systems that are not generating revenue? What about the effect on point of sale (POS) systems that are down, payment processing services that cannot process payments, SLAs that are violated and then result in SLA penalties, etc.)?

So, what are you waiting for?