The Definitive Guide to Event Correlation in AIOps: Processes, Tools, Examples, and Checklist
Over the past 20 years, I have experienced a brittle IT landscape. The pace of change with infrastructure and modern application development introduced a constant challenge for IT operations. Learn how event correlation improves IT operations, the role of AIOps, and my expert advice for choosing the right platform.
- Event correlation use cases
- Steps in event correlation
- Free event correlation tool checklist
- Proactive AI event correlation
What is event correlation?
Event correlation automates the process of analyzing monitoring alerts from networks, hardware, and applications to detect incidents and problems. Using an event correlation tool makes management of enterprise systems, applications and services easier and improves their performance and availability.
In computing, generally, an event is an occurrence or action initiated either by the system or a user. The act could be as simple as a mouse click or a website page loading. Event correlation focuses on events in which the result is not normal and signifies a problem. Historically, events associated with abnormal situations were sometimes referred to as “.”
Event correlation software ingests monitoring alerts, alarms and other event signals, detects meaningful patterns amid the deluge of information and identifies incidents and outages. The software speeds up problem resolution and improves stability and uptime for the system, application or service.
Advances in artificial intelligence, including machine learning, have strengthened event correlation. AI enables platforms to continuously improve correlation algorithms using the data they ingest and user input or user actions. This innovation, part of a trend called Artificial Intelligence for IT Operations (AIOps), makes the analysis of event data, the detection of problems and the surfacing of their root cause more efficient.
Event correlation in integrated service management
Event correlation plays a role in integrated service management. It is a popular way of managing IT operations as a service using a set of standardized methods. Integrated service management is a lean version of ITIL. The acronym is the current name for a comprehensive set of IT management best practices that began in the late 1980s as Information Technology Infrastructure Library.
Within integrated service management, there are six key processes: service level management, change management, operations management, incident management, configuration management, and quality management. Event correlation falls under incident management but relates to virtually all six processes.
System monitoring produces data about events that occur. Making sense of this information stream grows harder as an enterprise’s IT systems become more complex because the volume of event data grows. Challenges come from:
- Changing (meaning the arrangement of nodes, devices, and connections, their relationships to each other and their interdependencies)
- Combining cloud-based and on-premises software, computing resources,
- Practicing things such as decentralization, virtualized computing, and processing of increasing data volumes
- Adding, removing, or updating applications as well as integrating them with legacy systems
IT operations staff, DevOps teams, and network operations center (NOC) managers cannot keep up with the volume of alerts and detect incidents and outages in time, before they affect revenue-generating applications and services or other critical back-end systems. These factors raise the risk that incidents and outages will hurt the company’s business.
Event correlation software tackles this challenge by collecting monitoring data from across the managed environment and using AI to consolidate those monitoring alerts into clusters related to the same issue. As part of that process, the event correlation platform makes use of the latest, up-to-date topology data to identify and compare these clusters with system data on changes and network topology. The software uses this information to identify the causes and solutions for issues much faster and thoroughly than human technicians could.
Steps in event correlation
Event correlation is a multi-step process that begins with aggregating event data. Next, that data proceeds through filtering, deduplication, normalization, enrichment, and then correlation. Finally, the software recognizes which incidents are instances of the same problem.
This action could further investigation or corrective measures. Here are the event correlation steps in detail:
- Event Aggregation: This process encompasses gathering monitoring data from different monitoring tools into a single location. Enterprises integrate various sources into the solution, so all data is easily accessible on an as-needed basis.
- Event Filtering: Many solutions will proceed with filtering the data before any processing. This step can be done prior to event aggregation but is generally more accepted within the solution post aggregation.
- Event Deduplication or Deduping: This step identifies events that are occurrences of the same issue. Say 1,000 users encountered a particular error message over two hours. This process would generate 1,000 alerts, but there aren’t 1,000 problems. In reality, there were 1,000 instances of one issue. Similarly, if a monitoring tool generates an alert about a problem dozens of times, there is just one problem despite the dozens of notifications. Deduping makes this clear. Duplication can occur for many reasons. For example, a disk drive may be full. The monitoring tool monitoring that drive may generate hundreds or even thousands of alerts about this issue, over time, until the problem is resolved.
- Event Normalization: The normalization step ensures monitoring data collected from many different sources is put into a single consistent format that then allows the solution to use AI to correlate this data. For example, one monitoring tool calls something a “host” and another one calls them “servers.” Normalization may use “Affected Device” to refer to the contents of the “host” and “server” fields so that, regardless of the monitoring tool sending the data, the solution interprets it the same way.
Once the event correlation steps are complete, the tool identifies the source of the issue and the triggers an action by performing the following two actions:
- Root Cause Analysis: Next, the event correlation software analyzes the data to determine the underlying cause by looking for relationships and patterns among events. AI-driven machine learning accelerates and automates this analysis. The system compares the event information with log information on changes occurring in IT architecture, configuration, and software. (Gartner analysts estimate that 85 percent of incidents are caused by changes.) For example, this may reveal that a particular issue began occurring after an update to an application.
- Action Triggering: This finding triggers a follow-up action, such as further investigation of the application code change, automated remedial action, or escalation for implementation of a solution.
Event types in event correlation
Businesses correlate different types of events based on their IT environments and needs. However, there are several common types, such as events in operating systems, data storage and web servers.
- System Events: These are events that describe unusual states or changes in computing system resources and health, such as high CPU load or disk full.
- Operating System Events: These events are generated by operating systems, such as Windows, Unix, and Linux, as well as embedded operating systems, including Android and iOS. These operating systems are an interface between hardware and application software.
- Application Events: These events arise from software applications and include transactions such as e-commerce purchases or entry of visit notes by a healthcare provider. Events may occur in business activity monitoring software, which presents and analyzes real-time data on important company processes. This data is then fed to the event correlation tool.
- Database Events: These are events that occur in the reading, updating and storing of data in databases.
- Web Server Events: Events in the hardware and software that delivers content to web pages.
- Network Events: Events at a network level involving devices (routers, switches, etc.) such as the health of network ports, switches or routers, or events generated by network traffic going over or under certain thresholds, as an example.
- Other Events: Other types of events include synthetic checks, or probes, that check functionality from the outside in. Real-user-monitoring and client telemetry that generate specific events as users interact with the service.
Event correlation KPIs
The primary key performance indicator (KPI) in event correlation is compression. Expressed as a percentage, the KPI represents the proportion of events that are correlated to a reduced number of incidents.
The goal of event correlation is to identify all events related to a single problem. There will be events that stem from the root problem and symptomatic events as the original failure impacts other components. Operators can address both the cause and symptoms when they fully understand the relationship.
Ideally, you would want a compression number as close to 100 percent as possible. But in reality, that goal is impossible to achieve because as incidents near that level, the compression rates sacrifice accuracy. That means they incorrectly group events as stemming from the same issue or miss that a problem is related to another. Inversely, prioritizing accuracy depresses the compression rate.
Accuracy is not calculated by event correlation software. For example, company A may have very different events than Company B, and what is more important to one company over the other also varies. Therefore, the accuracy of the correlation between the two is next to impossible to calculate. Instead, accuracy is a soft, qualitative KPI that customers assess based on spot checks and business value evaluation.
Event correlation experts recommend, and I agree, companies making this tradeoff should strive for as high of a compression rate as possible without sacrificing accuracy and value to the business. Typically, doing so yields a compression rate of around 70 to 85 percent. But at the same time, higher correlation rates such as 85% or even 95% are not unheard of in some environments.
With analytics, event correlation software can give you insights into other event-driven metrics, which then help you make improvements in the effectiveness of your enterprise’s event management effort. To do so, you must look at raw event volumes and improvements resulting from deduplication and filtering. Evaluate enrichment statistics, signal-to-noise ratios and false-positive percentages. You can also look at event frequency in terms of the most common sources of hardware and software problems, so you can become more proactive in preventing issues.
Other metrics can be a byproduct of good event correlation. These metrics are typically found in IT Service Management, and are intended to evaluate how automated repairs, service teams, engineers and DevOps staff handle these incidents. Among these is a group of KPIs called MTTx because they all start with MTT, standing for “mean time to.” These include:
- MTTR – Mean Time to Respond, Mean Time to Recovery, Mean Time to Restore, Mean Time to Resolve, Mean Time to Repair: While there are slight nuance differences among these, they all aim to describe how long an outage lasts and how long the problem takes to fix.
- MTTA – Mean Time to Acknowledge: The average amount of time that elapses from an alert being raised to an operator acknowledging it (before starting work to address it.)
- MTTF – Mean Time to Failure: The duration between non-repairable failures of a product.
- MTTD – Mean Time to Detect: How much time is needed to discover an incident.
- MTBF – Mean Time between Failures: How much time between failures; A higher number indicates greater reliability.
- MTTK – Mean Time to Know: The average time required to discover an important issue.
For event management metrics, look at raw event volume, then note the decreases in event volume through deduplication, and filtering. For event enrichment statistics, use the percentage of alerts enriched and degree of enrichment, signal-to-noise ratio, or false-positive percentage. Specific event frequency is useful for identifying noise and improving actionability. Overall monitoring coverage, in terms of the percentage of incidents initiated by monitoring, is also valuable.
Sample MTBF report
Operationally, these numbers are meaningful. By monitoring and tracking these KPIs, you can gauge how well operations staff is performing. But the specific levels and changes from using event correlation software are unique to each organization. Event correlation software vendors, at best, can forecast a range of potential percentage improvement in MTTx metrics for a customer.
Industry use cases of event correlation
Event correlation platforms help enterprises’ operations teams achieve greater IT application and service reliability, a better experience for both internal and external customers, and stronger business outcomes such as increased revenues. Real examples across a variety of industries show how event correlation works.
A leading U.S. airline knew that even a minor service outage could cost millions of dollars in wasted fuel and lost revenue. In a bid to keep uptime high, the company used dozens of monitoring tools. But they were disjointed, and incident identification and resolution were manual processes. The carrier first rationalized and upgraded its monitoring tools then implemented an AI-driven event correlation solution. The benefits included centralized monitoring, reduced incident escalations, and a drop in MTTR of 40 percent.
Athletic Shoe Manufacturer
This major athletic shoe and clothing maker was overwhelmed by alert data from its IT monitoring despite implementing some event correlation tools. By upgrading to a machine learning-based solution, the company dramatically improved its ability to identify critical incidents, act quickly, and perform accurate correlation. Within 30 days, its MTTA dropped from 30 minutes to one minute.
Enterprise Financial SaaS Provider
This enterprise software as a service provider struggled to resolve incidents using a level one service team, only succeeding with 5 percent of incidents. The company especially struggled when alert volumes surged 100-fold during Friday payroll processing. By implementing AI-based event correlation, the level one team increased its resolution rate by 400 percent, reduced MTTA by 95 percent, and MTTR by 58 percent in the first 30 days.
A nationwide home improvement retailer suffered from extended outages at its stores because point-of-sale events were not being correlated. With event correlation, total outage duration dropped 65 percent, and average outage duration fell 46 percent. The company discovered a heavy volume of alerts, including meaningless ones, hampered the network operations center (NOC) in identifying significant incidents, and delayed their resolution. With a better event correlation solution, major incidents fell 27 percent while root cause identification increased 226 percent and MTTR by 75 percent.
If you are building a business case for investing in event correlation and want case studies relevant to your industry, please email us at firstname.lastname@example.org for more examples.
Event correlation approaches and techniques
Event correlation techniques focus on finding relationships in event data and identifying causation by looking at characteristics of events such as when they occurred, where they occurred, the processes involved, and the data type. AI-enhanced algorithms play a large role today in spotting those patterns and relationships as well as pinpointing the source of problems. Here’s an overview:
- Time-Based Event Correlation: This technique looks for relationships in the timing and sequence of events by examining what happened right before or at the same time as an event. You can set a time range or latency condition for correlation.
- Rule-Based Event Correlation: This approach compares events to a rule with specific values for variables such as transaction type or customer city. Because of the need to write a new rule for each variable (New York, Boston, Atlanta, etc.), this can be a time-consuming approach, and unsustainable over the long-term.
- Pattern-Based Event Correlation: This combines the time-based and rule-based techniques by looking for events that match a defined pattern without needing to specify values for each variable. Pattern-based is much less cumbersome than the rule-based technique but requires machine-learning enhancement of the event correlation tool. Machine learning helps the correlation tool continuously expand its knowledge of new patterns.
- Topology-Based Event Correlation: This approach is based on network topology, meaning the physical and logical arrangement of hardware such as servers and hubs, nodes on a network, and an understanding of how they’re connected to each other. By mapping events to the topology of affected nodes or applications, this technique makes it easier for users to visualize incidents in the context of their topology.
- Domain-Based Event Correlation: This technique takes event data from monitoring systems that focus on an aspect of IT operations (network performance, application performance, and computing infrastructure) and correlates the events. Some event correlation tools ingest data from all monitoring tools and do cross-domain or domain-agnostic event correlation.
- History-Based Event Correlation: This method compares new events to historical events to see if they match. In this way, a history-based correlation is similar to pattern-based. However, history-based correlation is “dumb” in that it can only connect events by comparing them to identical events in the past. Pattern-based is flexible and evolving.
- Codebook Event Correlation: This technique codes events and alarms into a matrix, and maps events to alarms. A unique code based on this mapping represents issues. Events then can be correlated by seeing if they match the code.
Importance of event correlation
Businesses depend on IT systems to serve customers and generate revenue. So IT issues threaten efficiency, customer service, and profitability. That makes event correlation a critical tool to support performance because the practice increases reliability and decreases problems and outages.
The stakes are high. A 2016 report from the Ponemon Institute put the average cost of enterprise server downtime at about $9,000 a minute. A 14-hour outage on Facebook and related apps in 2019 cost an estimated $90 million in lost revenue.
The rationale for event correlation software comprises these high-level proof points:
- Lower operational costs
- Faster response times
- Better time management and efficiency
- Improved SLA achievement
- Greater real-time visibility of events
- Reduced manual work and errors in correlation
- Enhanced customer satisfaction
- Smarter responses to events
- Higher customer satisfaction and loyalty
- Increased revenue
Benefits of event correlation software
The right event correlation software improves an organization’s resilience and moves it out of a firefighting, purely reactive mindset. Additional downstream benefits include automating key processes, faster resolution, and smarter root-cause analysis.
Best-in-class event correlation tools ingest event data, perform de-duplication, identify significant events from noise, analyze root causes, and prioritize IT response based on business objectives. The ability to collect all types of data from all sources helps IT teams breakthrough siloes that limit their ability to see the full picture.
By choosing the right platform, you can experience a quick and easy setup. In this scenario, a small internal or vendor team can establish integrations in days rather than weeks or months, and without the need of support from experts. Other important benefits are the ability to fit into your existing ecosystem and aggregate data from any monitoring system, extracting more value from existing investments. The system can also provide stronger regulatory compliance through easier reporting on network privacy and security.
AI-Driven event correlation
The rise of artificial intelligence has made event correlation more powerful by harnessing machine learning and deep learning. With these, exposing the system to large amounts of data improves correlation algorithms automatically rather than requiring algorithms to be trained manually, over many months or even years.
First, let’s quickly recap these concepts. The idea of AI has been around since the mid-20th century and seeks to make computers emulate human reasoning. Machine learning was the first real incarnation of AI.
In machine learning, computers act more like people in that they learn on their own by interacting with information. Previously, a coder had to program an instruction for every situation the computer encountered. The goal in machine learning is for the system to apply its new knowledge to unfamiliar data and interpret it.
Deep learning goes further by creating processing systems structured like the neural networks of the brain, in many layers (thus the deep description). Each layer refines its understanding and interpretation of the data. This process requires vast amounts of data and a lot of computing resources to work. So deep learning is a more recent innovation.
The objective is to apply these techniques to problems that only humans have been able to solve. For example, identifying the emotions a person is feeling from their facial expression or word choice and driving a car on a busy street with unpredictable hazards.
Event correlation in AIOps
Event correlation has tracked this evolution. While initially a process dependent on human coders and engineers, things changed around 2010. The introduction of statistical analysis and visualization was the first major advancement in event correlation.
More recently, machine learning and deep learning have given event correlation solutions the power to learn from event data to automatically generate new correlation patterns. This marked the application of artificial intelligence to event correlation.
Gartner analysts coined the category term AIOps in 2016. Use cases for AIOps span more than event correlation and include big data management and anomaly detection. In 2018, Gartner outlined the key functions of an AIOps platform as:
- Ingesting data from multiple sources regardless of type or vendor
- Executing real-time analysis at the point of ingestion
- Performing historical analysis of stored data
- Making use of machine learning
- Initiating an action or next step based on analytics
In event correlation, AIOps is the capability to process a flood of incident alerts, analyze them rapidly, uncover insights and detect incidents as they start to form, before they escalate into crippling outages.
One of the biggest challenges in AI is the so-called black box effect. In this effect, a lack of transparency around algorithms and machine learning instructions breeds distrust among users and slows adoption. AIOps event correlation tools can overcome this challenge by providing transparency, testability and control. The software can provide users with visibility into the correlation patterns, as well as the ability to create or modify these patterns, and then test them before deploying them into production.
As AIOps matures, event correlation will offer pattern-based prediction and the detection of root causes that people miss. AI-driven event correlation solutions will simply plug in and tell incident managers what to do and how to do it.
Analyst recommendations on AIOps
In a November 2019 market report, Gartner Research estimated the AIOps platform market at $300 million to $500 million a year. They predicted that by 2023, 40 percent of DevOps teams would add AIOps platforms to their toolset.
Gartner’s analysts recommended that companies take an incremental approach to AIOps. They should start with less critical applications and event categorization, correlation, and anomaly detection. Over time, they can use tools to reduce false alarms, test the value of patterns, become proactive in preventing impact, decrease outage duration, and improve IT service management.
GigaOm’s 2020 analysis of the AIOps landscape found that tools on the market span a spectrum of AI adoption and predicted a consolidation of players. Many event correlation tools have bolted on AI as an afterthought, GigaOm reported. So, enterprises need to study the offerings carefully and clearly understand their capabilities, including compatibility, which has been problematic for some tools. Another differentiator is whether to choose a cloud-native, on-demand model or on-premises.
Common misconceptions about event correlation
As GigaOm noted, there are many differences among tools that fall under the event correlation umbrella. But some misconceptions are common across the board. Two of these relate to real-time processing and anomaly detection.
- Real-Time Processing: Many customers believe that machine learning equips event correlation solutions to perform real-time processing and correlation on novel events. However, this capability is not offered by any vendor because it requires advances in AI and massively more computing power than is practicable.
- Anomaly Detection: Users are also often confused about how anomaly detection relates to event correlation. Anomaly detection is a function of monitoring and observability tools that looks at a single, isolated metric such as CPU load over time, and can detect when this metric enters an anomalous state (e.g. the baseline for CPU load = 30%; if the CPU load were to hit 70%, then it would be considered anomalous compared to the baseline.) When monitoring and observability tools detect anomalies, they generate an event pointing out that an anomaly was observed. This output is one of the data streams fed into the event correlation engine. No event correlation solution currently available also performs anomaly detection.
How to pick the right event correlation tool for your company
The right event correlation solution can enable your IT Ops efforts to deliver better business value. But competing claims and opaque technology can make it hard to know which tool is the best match for your needs.
Here is a checklist of key features and functions to consider. Use the free downloadable evaluation scorecard to compare vendors on different dimensions, weighted by importance to your company.
Checklist of key event correlation features
Evaluate the solutions on user experience, functionality, platform architecture, integration partners, vendor service, and strategic alignment. Here are some of the highlights for each, with a full list in the downloadable scorecard template.
- User Experience
- Security and convenience of access
- Intuitive navigation
- Modern, understandable user experience
- Unified console – Single pane of glass for incident resolution
- Native analytics – easy to set up and understand
- Third-party analytics – easy to integrate with best in class BI solutions
- No expertise required
- Data sources ingested
- Data stream hosting platform
- Types of events correlated
- Topology tools
- Ingest sources and formats
- Data interpretation and enhancement
- Root cause detection
- Correlation methods used
- Fully automated root cause analysis
- Ability to drill into root cause changes
- Ability to visualize incidents in context of topology / environment
- Ability to visualize events in context of incident timeline
- Machine Learning/AI
- Level-0 automation – automation of manual tasks
- Agile friendly
- Integration technology and the ability to integrate all event feeds from existing tools (protect investment)
- Integration technology and the ability to integrate all event feeds without consultants (cost)
- Integration technology and the ability to integrate all event feeds quickly (time)
- Strategic Factors
- Vision Alignment
- Roadmap Alignment
- Business Model Alignment
- Company Culture
- Industry Strength
- Financial stability
- Customer satisfaction
- Integration – Observability / Monitoring tools vendors
- Integration – Topology / Change tools vendors
- Integration – Collaboration tools vendors
- Systems integrators
- Resellers (Value Added)
- Cloud provider
- Proof of Value (POV)
- Implementation – deployment timeline estimate
- Customer success program
- Customer support SLA
Download Event Correlation Solution Scorecard Template
Why top enterprises choose BigPanda for event correlation
BigPanda is a best-in-class event correlation platform powered by AIOps. Identified by GigaOm as among the strongest AIOps players, the platform helps modern enterprises reduce IT noise by 95%+, lets you detect incidents in real-time, as they form and before they escalate into outages, and empowers your IT Ops team to focus on high-value work. IT Ops teams that use BigPanda can improve the performance and availability of their critical applications and services, lower their operating costs and accelerate their business velocity.
Compared to alternatives, BigPanda offers quicker and easier implementation (most customers go-live in under 12 weeks), faster incident detection thanks to pre-trained ML models, and compatibility with all of your existing monitoring, change, topology, collaboration, ticketing and chat tools. The result is more value from your existing IT investments, lower total cost of ownership, and shorter time to value, measured in weeks not months.
Moreover, BigPanda’s unique Open Box Machine Learning technology offers unparalleled transparency of your correlation logic, and the ability to edit and test this logic before deploying it into production. While other vendors obscure their AI/ML logic, BigPanda users gain confidence by being in control at all times.
Want to see BigPanda event correlation in action? Request a demo here