Autodesk uses BigPanda’s Event Enrichment Engine to accelerate their IT Ops
Today’s incident pipelines are noisy. The average enterprise deals with at least 15 different monitoring and observability tools that create thousands of alerts a day, often overwhelming and drowning their IT operations. But it’s not just their number that’s an issue. It’s also the fact that while these alerts contain a wealth of low-level technical information, they lack operational, topological and business context – things like priority, affected apps or locations, or the most appropriate team that needs to investigate. This makes it difficult to correlate them into meaningful incidents, determine their root cause, understand next steps, or automate their various incident management tasks.
The result? Prolonged or even undetected outages that inflict downtime-related misery on admins, engineers, users, and customers for extended periods of time. To overcome this challenge, IT Ops teams need to cleanse, prepare and enrich their alert data payload with rich, cross-domain operational and topological context – and to do all of this at scale.
The “at scale” part is what makes this tricky. If you only have one or two monitoring tools, or one or two IT/Dev teams, it might not seem difficult to configure those tools to add enrichment details directly to the alerts as they are created. But as you deploy new tools, create new applications, or onboard additional teams, the overhead of creating these new enrichments across every tool or team becomes taxing. This is why a single AIOps solution that collects all the alert data, centralizes enrichment, and uses those enrichments to intelligently correlate the alerts into incidents across every functional domain makes the most sense for modern enterprises.
I recently had the pleasure of hosting a webinar with Sid Roy, VP Operations and Client Support at Scicom, who provided valuable insight into data acquisition and preparation, and Samy Senthivel, Sr. Manager of Engineering-Observability at Autodesk, who discussed their centralized event enrichment process using BigPanda. You can watch the on-demand webinar here.
In this blog we’ll focus on the event enrichment phase – and how Autodesk leverages it to help reduce their IT noise by 95% and accelerate their IT operations.
Autodesk’s incident management lifecycle
Autodesk’s incident management lifecycle consists of five stages: issue triggering, event creation (monitoring tools), event ingestion (data pipelines), AIOps (event enrichment, correlation and analysis), and action (incident response and remediation).
The first stages include the creation of monitoring events and alerts, when issues are detected in collected logs and metrics. As Samy states, Autodesk has 25 different tools that monitor their applications and infrastructure. These tools create data streams, but do not analyze or correlate them. So the Autodesk IT Ops teams often have to deal with a huge amount of alerts, metrics and logs, which are then sent to three dedicated pipelines.
This is where BigPanda comes in. The alert pipeline is ingested by BigPanda, and enriched with data from Autodesk’s ServiceNow CMDB, Maintenance Plans and Change Requests; their VMWare VCenter; their AWS Cloud metadata and their Dynatrace Service Maps. These enriched alerts are then correlated by BigPanda’s Open Box Machine Learning to high-level insight-rich incidents, and passed on to the operators for triage, remediation and resolution.
How enrichment helps accelerate Autodesk’s IT operations
Autodesk utilizes BigPanda’s robust enrichment engine in many ways to enhance correlation and assist in root cause analysis.
Autodesk is able to utilize its host naming convention to extract and create enrichment tags that provide information about the alert device, domain, device function, location and more. This provides an abundance of information for both correlating the alerts and understanding their context for root cause analysis, for example by correlating based on a common city tag value.
Autodesk application and topology maps are ingested by BigPanda and used to enrich alerts through dependency mapping that is fed into enrichment tables (CSV tables that define which tag to query and the resulting value – multiple maps can be created within a single environment). This provides alerts with relevant topology context for better correlation. In the example below on the right – by detecting that all alerting apps are connected to a single VMWare storage that failed, their thousands of alerts are correlated into one single incident whose root cause is easily detected.
Composition enrichment allows the combining of several tag values to create a new tag value – such as a runbook URL which is created by combining a wiki base URL with the cluster type tag value and the check tag value. Autodesk uses the composition capability, among other things, to identify the team responsible for the alert by adding a CI type tag to the source tag. This allows the triggering of an automated notification through PagerDuty, and a relevant ticket opened in ServiceNow.
These enrichment examples and others help Autodesk achieve amazing correlation and IT noise reduction of 95%!
As you can probably understand from the examples above, BigPanda’s Event Enrichment Engine allows you to gradually grow your enrichment capabilities. Start small, slowly adding data sources and enrichment methods to your alerts. As your correlations improve and your root cause analysis becomes more accurate (you’ll see this happen in our Unified Analytics reports), you’ll be able to substantially shorten your MTTR and improve your IT operations across the board. To learn more visit our Event Enrichment Engine page on our website.