Something happened to enterprise IT starting in the early 90s. It stopped being a ‘nice to have’ run by Office Space types, and started to become the foundation for systems, applications and services used by different business units, departments and teams – touching every part of every organization, everywhere.
Because of that, today, nearly thirty years later, it’s inconceivable to think of any modern organization functioning normally in the face of outages or problems affecting their IT stacks.
If you’re a law firm, email outages could cripple you because much of your client communication flows through your email system. If you’re a bank that launched a mobile check deposit feature to great fanfare, imagine your customers’ reactions when the app doesn’t let them deposit any checks. Or if you run a massive gaming platform, imagine the howls of outrage when the latest installment of your hit-game franchise doesn’t let users log on…I could probably give you a hundred examples, and you could probably come up with a hundred more.
That’s why IT outages are the bane of IT operations teams everywhere, and today’s highly sophisticated and dynamic IT stacks that power modern applications aren’t helping.
But what is helping, at last, is AI in IT operations.
Over the last five years, AI in IT operations, aka AIOps, has started to dramatically slash the frequency, duration and impact of major IT incidents and outages. It is transforming the role of IT operations inside enterprises.
Thanks to AIOps, IT operations need not just play a ‘behind the scenes’ role supporting uptime; instead IT operations can help the organization reliably serve customers, cement loyalty and protect revenue, and confidently accelerate strategic initiatives such as cloud migration and app modernization.
If you are an IT exec, IT Ops leader, or tooling architect that has already embraced AIOps and your teams are enjoying the benefits, I congratulate you. You have achieved what many of your peers still dream of.
But if you’re still researching AIOps, or you’re about to embark on an AIOps project, this is the perfect time to learn about the five critical functions of AIOps.
Here is our take on the five functions that Gartner, the global research and advisory firm, wrote about in their latest 2021 Market Guide for AIOps Platforms.
Similarly, different IT vendors excel in different areas or domains. Savvy IT architects and IT leaders know this, and have probably architected their IT stack to be highly heterogeneous, and connected together via APIs.
For IT operations, this means that valuable IT data, alerts and events must be sourced from different tools that cover different technology domains (such as virtualization, servers, Java-based apps, AWS-based systems, etc.)
Therefore, AIOps solutions must be able to easily ingest data from multiple sources, vendors and technology domains, ideally using a combination of out of the box connectors and flexible APIs.
Data types can vary, but you should prioritize ingestion of alerts (for actionable triggers), as well as change and topology data (for root cause analysis; more on this later).
But ingestion is only half the story. What do you do with the data once you ingest it?
Because outages must (ideally) be caught in real-time, your system must be able to process the ingested data streams in real-time, normalizing and enriching this data with operational context as soon as it’s collected.
At the same time, parsing historical operations data for useful trends, and understanding how KPIs and metrics change over time, is super valuable for measuring, tracking and improving different aspects of IT operations. So, in addition to real-time processing after ingestion, your AIOps platform must offer historical analytics and reporting capabilities.
When applications were designed with the classic three-tier architecture that changed every few weeks or months (when new servers were added to a cluster, for example), understanding their topology – and tracking changes to the topology over time – was easy.
A server outage invariably meant that the application hosted on that server was also impacted. Life for IT operations teams responsible for detecting and mitigating the outage was simpler (though not simple.)
In today’s complex and dynamic IT stacks, app and service topologies are not only fragmented across several systems (orchestration, configuration management, CMDBs and others), but they’re also constantly changing. Understanding the relationships between IT assets, applications and services has never been more complex.
IT operations teams trying to find the root cause of a problem, or the downstream implications of an outage, have never had it harder.
So your AIOps platform must be able to dip into the sources of topology in modern environments and stitch together a real-time, up-to-date topology model.
This model makes it easy to understand the dependencies and connections between different servers, network devices, cloud-based resources and other IT assets. Now, when there’s an outage, IT operations teams can identify the root cause more easily, and shorten downtime.
Once data from different tools and sources is ingested, and a real-time topology model has been assembled, the real work begins.
Your AIOps system must be able to correlate all of your alerts and events, which, for the typical enterprise, can easily be tens or hundreds of thousands of daily alerts, against your unified topology model.
There are two major implications of event correlation.
- Event correlation reduces or “compresses” events by as much as 95%. This means that your IT operations teams are no longer drowning in data or chasing false positives, or simply ignoring the valuable alerts generated by expensive observability and monitoring tools.
- With event correlation, all the alerts related to a single outage are grouped together (or correlated) into a single incident. Since a single outage commonly triggers dozens or even hundreds of alerts across a dozen or more tools, your IT operations teams can focus on troubleshooting one incident instead of wasting time on 60 seemingly different yet related events.
Advanced AIOps solutions will also correlate change data, assuming it was ingested earlier.
As every IT operations team is all too aware, DevOps-led application modernization projects and the embrace of DevOps and SRE models result in hundreds or thousands of application, service and environment changes every day.
Since any one of these changes can result in an unintended outage or incident, surfacing that root cause change of an outage or incident in real-time can be invaluable.
What should your AIOps solution do once it ingests and correlates events? It should recognize or detect incidents in real-time before they escalate into crippling outages, and continually learn and refine its correlation logic in order to detect future incidents more efficiently.
The ability to define and then refine correlation logic through user actions enables better outcomes, because the system internalizes your organization’s “tribal knowledge” to identify and process incidents more efficiently.
This means that your team will be able to catch a P2 or P3 incident as it forms.
Another aspect of recognizing what’s critical and what’s more critical compared to other incidents, is the concept of business priority. So an incident affecting a VIP customer should be treated more urgently than an incident affecting a non-VIP customer, as an example. Or, an incident affecting the highest revenue-generating region should be handled with more urgency than an incident affecting the lowest revenue-generating region. But the incoming event stream doesn’t contain this prioritization matrix. Generally, such information is stored and updated in 3rd party systems. So your AIOps solution must be able to infuse this business context into the incidents that it detects, so your teams can recognize and appropriately handle incidents with different levels of business severity and priority.
Furthermore, correlation logic should improve by learning from users’ interactions and actions taken on the incidents within the platform. For example, if users continually merge certain types of alerts into existing incidents or repeatedly split existing incidents, the system should learn from these actions to improve. When your AIOps platform is exposed to new tools or datasets used across your organization, it should be able to recognize patterns in the data and recommend newer, more efficient correlation logic.
Remediation is the critical last step in the incident management lifecycle.
Every IT operations team, once it detects an incident and uncovers the root cause, wants to remediate it, restore service and…move on to the next outage or incident!
AIOps platforms must automate responses to incidents and trigger external automations with collaboration or ITSM tools to facilitate the rapid remediation of your incidents
- Automate manual tasks: IT operations teams spend valuable time on various manual tasks that delay remediation. They could waste minutes during the incident triage phase looking up runbook information locked away in external sources such as spreadsheets. Or, they could waste precious seconds snoozing a specific type of incident every time it occurs. There are many, many such examples of course. Your AIOps system must give you the ability to automate these actions to cut a minute here, a few seconds there, and so on, so you can meaningfully reduce your MTTR (mean-time-to-remediation.)
- Automate incident sharing: Many organizations use collaboration tools to share incident information with their service desk teams and their L3/DevOps teams in order to remediate incidents and restore service. But because the process of sharing this information is still very manual, it delays remediation and elongates MTTR. For example, if it takes 8 minutes for an IT operations team to manually create a ticket in ServiceNow and enter all the relevant information once an incident is detected, and if this team must do this 20 times an hour, they’ve just delayed remediation across of their incidents, cumulatively, by 23,360 hours every year. To flip it, with simple automated ticket creation, this organization can accelerate incident remediation by 23,360 hours every year! You should therefore automate incident sharing with ticketing and service desk systems, and on-call notification and paging systems; you should also automate the creation of war-rooms in message and chat systems. In addition to boosting your teams’ productivity by eliminating mundane manual tasks, this ensures that the right experts in your organization are notified when there’s a problem they need to investigate and resolve. This accelerates the resolution of such problems.
- Invoke an external automation system: Consider an application experiencing a P2 incident because the /var/tmp partition on a server is full. Remedying this issue requires an operator to ssh into the server and clear out this partition, but eats up 5 minutes every single time he or she takes this action. If this were a recurring issue, wouldn’t it accelerate remediation if the AIOps platform could invoke an external automation system to take this action every time this problem is detected? The answer, of course, is yes! As long as you have a vetted set of remediation actions that you’re comfortable automating, your solution must allow you to invoke 3rd party automation systems such as Ansible, StackStorm, Rundeck and others to execute those actions. Note here that you must have the flexibility to integrate with the automation system of your choice – whether it’s a commercial offering or an open-source project.
After decades of promise, AI is changing how enterprises operate, and one of the areas where the impact of AI has been nothing short of transformational is IT operations. But many organizations and IT leaders are just starting on their AIOps journeys, so it’s important that they choose wisely, at the onset. I hope that you, your tooling managers and architects find this short writeup on the five central functions of AIOps platforms to be useful and actionable.
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
GARTNER is a registered trademark and service mark of Gartner Inc. and/or its affiliates in the U.S. and internationally, and is used herein with permission. All rights reserved.