Tips for Modern NOCs – Correlating Incidents to the IT Changes that Caused Them
Every NOC engineer will tell you that the first thing they look for in an outage is “what changed?”. And they are right to look. While every organization is unique, Gartner reports that on average about 80% of IT incidents today are caused by changes in infrastructure and/or software. Yet, for various reasons, the tools & processes of operations and incident management are often poorly (or not at all) integrated those of change management; the ability to correlate incidents to the changes that caused them is limited. If you find yourself needing to sift through dozens of records in a multitude of screens in order to discover if a change might have caused an incident, this blog is for you.
Change is Still Not Part of the Incident Management Lifecycle, and It Hurts
For many years ITIL was (and generally still is) the de facto framework for standardizing the selection, planning, delivery and maintenance of IT services within a business. Although ITIL prescribes a unified approach, in practice many organizations address Operations Management (the monitoring and control of the IT infrastructure and services) separately from Incident Management (the process of fixing a service degradation / outage), and find it difficult to integrate these processes with Problem Management (finding the underlying reason for a recurring incident to make sure it does not return). L1 engineers watch monitoring consoles, then create incident tickets which are handled by L2/L3 engineers. There is another team to address incident trends as a “problem”. And yet another team to implement corrections via Change Management. Many vendors provide integrated toolsets for these functions, but the practicalities of collaboration and technical integration make it difficult to achieve a homogeneous solution.
To summarize, the evaluation of change as a causality is difficult and rarely done in the early stages of an incident (i.e. during alert triage). Even with integrated toolsets, the separate teams tend to use the separate modules (monitoring, incident, change, and problem management) as separate tools with separate processes. Only after an incident appears in our Incident or Problem Management console do we go and try to find out if there is a related change, somewhere else in our organization, that could have caused it. And this is only possible if the change was planned and/or tracked. If it was unplanned and untracked, we are in for a lengthier more difficult root cause analysis process. And all the while, the L1 engineers watching those monitoring alerts are unaware that a change has caused an incident.
The Solution: Notification Automation and Integration
Automate notifications through code: Unsurprisingly, an important part of the solution can be found in the DevOps handbook. If you get the chance to read through it, you’ll see that an important part of DevOps is automation and “everything as code”. And the same is true here: make sure your change notifications are automatically part of your code. For instance, when you script a patch rollout, part of the code should take care of inserting the relevant change notifications into your change and incident sheet, notifying that the script has run. When using an orchestrator, make sure you build your systems and your scripts to do this as well. Don’t leave the responsibility for change notifications to the engineer. And do this for every automation you do, even a simple server reboot, and you’ll be able to capture the smallest of ad-hoc changes that you never thought would interest anyone.
Integrate these changes into your event management system: I often hear prospects tell us “You don’t get it, we’re trying to deal with less information, we don’t want to see all these changes. They’re just more noise”. But the key is to not just inject the changes into your event management as a random information stream, but rather to present them in context to your incidents – timewise, from a domain aspect, in relation to topology and geography – or any other context relevant to your organization. In essence – to use the same enrichment attributes that are used in the alert data, with your changes. This allows you to view all the changes in one place, filtered in context to the incident you are working on, helping you answer the “what changed?” question faster and easier.
Root Cause Changes with BigPanda
BigPanda’s Root Cause Changes feature aggregates change data from all your change feeds and tools, including CI/CD, Change Management and Auditing. It then uses Open-Box Machine Learning technology to identify the changes that likely caused the incident. With Root Cause Changes:
- Change data from all your change feeds is presented in one screen.
- This data is filtered, correlated and presented in the context of the incidents you are working on.
- BigPanda surfaces and suggests the changes that are likely to have caused the incident, constantly learning and improving based on your responses.
Take a look at this short video and see Root Cause Changes in action: