Blog:

Root Cause Changes: Real Examples of Modern Root Cause Analysis from our Beta Customers

Root Cause Changes: Real Examples of Modern Root Cause Analysis from our Beta Customers

By |2019-11-22T20:01:40+00:00November 21st, 2019|Blog|

Root Cause Analysis (RCA) is an all-encompassing process.

It is usually very complicated and often requires many people with many different skills – all trying to tackle an incident to determine what happened, when, why, how and ultimately who (to blame).

There is, however, secret sauce today that can help solve many issues before a “full-scale” RCA process is initiated – and that is Root Cause Changes (RCC).  Since the majority of issues and outages today are caused by changes to infrastructure and software (over 80% according to Gartner), theoretically you only need to find out what changed in order to resolve the incident quickly.

The problem is that finding these changes is not easy. In today’s fast-moving IT, the number and frequency of changes have skyrocketed. Some of our customers report over 4000 changes every week. These changes cover almost everything we do: deployments, software upgrades, configuration changes, scaling and more. They are manifested in many different tools, so visibility is a problem, and to make matters worse, many of these changes happen either automatically or by mistake, without us knowing about them. 

That is why BigPanda’s new Root Cause Changes is such an important feature. BigPanda is the only AIOps solution that automatically analyzes the information collected from all the change tools, including CI/CD tools, and correlates it to all the monitoring alerts collected, to quickly identify root cause changes and enable immediate resolution. 

Here are some examples of our beta customers’ experiences with Root Cause Changes (host and services names have been changed for privacy):

Simple Security Patch Leads to Complicated Performance Issues   
The first example is a classic change related incident – installing a security patch causes a performance problem for end-users. 

As can be seen, the incident starts at 6:54 pm with alerts related to “a high count of error logs” in MS Exchange, coming in from DataDog:

BigPanda surfaces the root cause change: the installation of a required security hotfix related to a recent Wintel vulnerability, at 5:24 pm:

The incident was resolved by restarting the application pools. 

Firewall Changes Cause E-Commerce Interruptions
At 6:27 pm the NOC receives an alert from AppDynamics about slow end-user performance on an e-commerce kiosk solution:

BigPanda surfaces the root cause change: at 5:21 pm on the same day, two firewall zones were migrated. During the change, some queues were stopped to route the traffic, but they did not restart when the change was completed:

Shipping Module Update Crashes Mobile App
At 4:05 pm, the NOC receives alerts associated with crashes on the mobile app and multiple function executions:

BigPanda surfaces the root cause change: at 3:52 pm, AWS Lambda functions were removed from one of the mobile application gateways to clear up IP space:

A rollback to this change solved the problem.

With BigPanda’s Root Cause Changes feature, it’s easy to view all the changes related to an incident, and identify it’s Root Cause Change, by simply clicking on the “Related Changes” tab when looking at an incident.

It’s that simple.

Want to learn more about Root Cause Changes and what BigPanda can do for your IT environment, and your enterprise? Schedule a demo now to get started.

About the Author:

Haim Snir
Haim is a Senior Product Manager at BigPanda. He has over 15 years of experience in the IT Operations space and has worked closely with a number of F1000 customers. Before joining BigPanda, Haim founded and ran his own company, an IOT monitoring solution for the enterprise market.