The “holy grail” of IT incident management is how to determine the root cause of a major incident… and quickly! Every minute wasted by the Service Operations team on Root Cause Analysis means lost revenue, angry customers, or worse.
When the storm hits, it’s best to have a plan. Whether your enterprise follows ITIL Problem Management guidelines or some other process, it’s good to have a robust IT Event Management platform to consult for answers. Preferably one that’s intelligent in how it processes and analyzes event data.
Any kind of Root Cause Analysis method must stress the trending and analysis of incident data related to alerts and failed changes in order to determine the true source and/or root cause of a service disruption, whether large or small. First priority is always to restore the service, then use event data to understand what went wrong and to accurately analyze the source of the incident and/or failed change.
Event management platforms with machine learning capabilities use algorithms to recognize incident patterns over time, so that similar incidents can hopefully be avoided in the future. This is particularly important functionality for enterprises that have adopted ITIL, which stresses not only Root Cause Analysis but also trend analysis to target and fix chronic systemic issues.
How the BigPanda Platform supports Root Cause Analysis
“How can you help us with Root Cause Analysis?” is the question we get most asked by the Fortune 1000 enterprises we serve. Probably because of the substantial positive impact that efficient RCA can have on maintaining service availability and reliability.
The good news is BigPanda provides meaningful help in identifying root cause. Incident data ingested by our algorithmic engine can be analyzed to significantly accelerate remediation during those “critical moments”.
Here are 6 methods of using BigPanda’s IT event data for Root Cause Analysis (RCA):
Method #1 – Correlation
BigPanda automatically correlates alerts belonging to the same outage. This provides meaningful context for investigation. Here are a couple examples of how it works:
- Example #1: an application latency alert is correlated to a database load issue. The effect is directly correlated to the incident’s cause.
- Example #2: hundreds of network issues resulting from a DDOS attack are correlated together. Instead of chasing down multiple alerts, all symptoms of the incident can instead be reviewed in one context.
Method #2 – Smart Titles
BigPanda’s logic ascribes Smart Titles to a correlated IT incident. It does so by identifying patterns in any alert storm – intelligently flagging the service or system that is the common denominator of all events related to a single outage. A couple examples:
- Example #1: tens of servers throw up load issues in supporting the same application. The Smart Title in BigPanda will describe the problematic application.
- Example #2: network alerts storm in from a variety of devices connected to switches, which are then connected to other switches. The incident’s Smart Title identifies the one switch that is common to all the issues across the network.
Method #3 – Timeline
BigPanda’s Timeline feature aids incident forensics, using visualizations to show how an incident unfolded over time. NOC operators or SREs can easily identify which alerts occurred first, and how they affected other alerts over a given period.
- Example: a “low memory” alert triggers a “SWAP utilization” alert, which triggers a “high disk I/O” alert, which cascades to a “high CPU load” alert, ending in a “high system load” alert. Visual analytics make it easy to trace RCA back to the event that started it all.
Method #4 – Unified Search
BigPanda’s Unified Search makes it easy to find historical occurrences of the same issue. This is particularly useful in incident reporting. Investigators can review any comments made about the incident, past or present. Service Ops teams can see which people worked on the incident, aiding collaboration across distributed NOC environments. This can considerably accelerate the enterprise’s understanding of the issue.
- Example: application X generates timeouts. A search for the application’s incident history reveals a similar incident that occurred nine months ago. That report includes a comment describing that the event was triggered when a certain folder reaches capacity. A quick check of the folder in question reveals that it is indeed full. Clean the folder and the alert is cleared.
Method #5 – Script Execution
BigPanda supports automatic script execution, which allows commands to be autonomously executed in your environment when a particular alert occurs. The platform can run various checks on problematic hosts or applications even before the NOC operator sees the first alert.
- Example #1: a “host unreachable” alert automatically triggers execution of a ping command from several servers. The operator can instantly see which servers can access the unreachable host and which cannot, providing visual insight into root cause.
- Example #2: a “high load” alert from host X triggers the execution of a “list processes” command on that host. The operator can see which process is causing the heavy load, clearly identifying a suspected root cause.
Method #6 – Automatic Links
BigPanda can generate links autonomously to various webpages that may contain valuable reference information to aid incident investigation. Displaying just the right help and support content can considerably accelerate the RCA process. The platform intelligently pushes information and insight to the operator, instead of the operator having to locate it, enabling efficiency and productivity.
- Example #1: a “high error rate” alert from AppDynamics links to a related Splunk search.
- Example #2: a “disk space” alert links to a graphite dashboard showing the seven-day trend for that disk.
When seeking improvements in Root Cause Analysis, look for an IT event management platform that can intelligently and autonomously leverage incident data for forensics, investigation and resolution.