Event types and use cases for event correlation
As organizations grow and become more complex, so does the need to monitor and troubleshoot issues across the entire IT infrastructure. Event correlation is a powerful technique that can help make sense of the huge volume of alert data generated by monitoring systems and identify problems as they occur. In this blog, we’ll look at event types, use cases for event correlation and approaches that organizations can use to get the most out of this valuable tool.
Event types in event correlation
There are several event types that are commonly used in event correlation. These include:
- System events: These are generated by the operating system and can include startup and shutdown events, process creation and termination events and network connection events.
- Application events: These are generated by applications and can include user login and logout events, file access events and database query events.
- Security events: These are generated by security-related systems and can include intrusion detection events, malware detection events and firewall events.
- Database events: These are generated by database systems and can include database access events, database update events and database query events.
- Web server events: These are generated by web servers and can include HTTP request events, HTTP error events and HTTPSession events.
- Network events: These are generated by network devices such as routers and switches and can include interface up/down events, link flapping events and border gateway protocol (BGP) neighbor events.
- Other events: There can be other types of events that are not covered by the above categories. These can include environmental events (temperature, humidity, etc.), equipment events (upset conditions, maintenance needed, etc.) and business events (sales order placed, customer service call received, etc.).
Event type KPIs
There are several key performance indicators (KPIs) that can be used to measure the effectiveness of event correlation. These include:
- Compression rate: This is the percentage of events that are successfully grouped together by the event correlation software. A higher compression rate indicates that the software is effectively identifying and grouping together related events.
- Accuracy: This is a measure of how often the event correlation software correctly identifies and groups together related events. A higher accuracy rate indicates that the software is more effective at identifying and grouping together related events.
- Enrichment statistics: This is a measure of how often event correlation software successfully adds additional information to events, such as contextual information or metadata. A higher enrichment rate indicates that the software is more effective at adding additional information to events.
- Signal-to-noise ratio: This is a measure of how many events are successfully grouped together by the event correlation software, compared to the total number of events that are processed by the software. A higher signal-to-noise ratio indicates that the software is more effective at grouping together related events.
- False-positive rate: This is a measure of how often the event correlation software incorrectly groups together unrelated events. A lower false-positive rate indicates that the software is more effective at correctly grouping together related events.
- Event frequency: This is a measure of how often events occur. Event frequency can be used to identify the most common sources of hardware and software problems in order to become more proactive in preventing issues.
There are also a variety of indicators that may be utilized to evaluate how incident responders, service teams, engineers and DevOps staff handle problems. These criteria are typically found in IT service management (ITSM), which includes the following:
- Mean time to resolve (MTTR): The average amount of time it takes for an incident response team to resolve an issue.
- Mean time to repair (MTTR): The average amount of time it takes for a service team to repair an issue.
- Mean time to detect (MTTD): The average amount of time it takes for engineers or DevOps staff to detect an issue.
- Mean time between failures (MTBF): The average amount of time that passes between incidents.
- Mean time to know (MTTK): The average amount of time it takes for an incident response team to become aware of an issue.
BigPanda helps organizations improve performance and availability of their critical business applications and systems, reducing MTTR by 50% or more.
Industry use cases of event correlation
Event correlation platforms help organizations reduce IT operations cost, improve service availability and reliability and accelerate digital transformation. By improving operations, event correlation can lead to increased revenues and a better overall experience for customers. Below are some specific examples of how event correlation is used in different industries:
A prominent U.S. airline company was aware that even a slight service outage could end up costing millions of dollars in wasted fuel and profit losses. In an effort to maintain high uptime, the company utilized several different monitoring tools. However, these tools were not connected, and incident identification as well as resolution were both left up to manual processes. The carrier began by streamlining and upgrading its monitoring tools and then implemented an event correlation solution that was driven by artificial intelligence (AI). The benefits of this included centralized monitoring, reduced incident escalations and a 40% decrease in MTTR.
CPG manufacturing industry
Despite having event correlation tools in place, a large athletic shoe and clothing manufacturer was feeling overwhelmed by the alert data from its IT monitoring. By upgrading to a machine learning-based solution, the company was able to vastly improve its ability to identify critical incidents so it could act quickly and perform accurate correlations. Within just 30 days, its mean time to acknowledge (MTTA) had decreased from 30 minutes to one minute.
Enterprise financial SaaS provider
An enterprise software as a service (SaaS) provider was having difficulty resolving incidents using its level one service team—managing to successfully resolve only 5% of incidents. The company experienced particular difficulties when alert volumes increased 100x during Friday payroll processing. However, by implementing AI-based event correlation, the level one team saw its resolution rate increase by 400%, its MTTA decrease by 95%, and its MTTR decrease by 58% within the first 30 days.
Event correlation approaches and techniques
Some of the most common approaches and techniques include:
- Time-based analysis: This approach looks at when events occur in relation to one another. For example, if two events happen close together in time, they may be related.
- Rules-based event correlation: This approach uses predetermined rules to identify incidents. The rules are typically created by experts such as system administrators or engineers and can be based on factors such as alert keywords, timeframes or event sequences.
- Topology-based event correlation: This approach looks at the relationships between different nodes in a system to identify incidents. For example, if two servers are connected and one goes down, the other may be affected as well.
- History-based event correlation: This approach uses information from past incidents to identify current and future incidents. For example, if an incident has been previously logged and resolved, the same incident may be less likely to occur in the future.
- Codebook event correlation: This approach uses a codebook, which is a database of known incidents, to identify and resolve current incidents. The codebook can be created manually or automatically through machine learning (ML).
- Pattern-based event correlation: This approach uses machine learning to identify patterns in data and then looks for similar patterns in new data. This allows for the automatic identification of incidents without the need for predetermined rules or a codebook.
- Domain-specific event correlation: This approach uses domain-specific knowledge to identify incidents. For example, in the IT industry, an administrator may be familiar with the different types of events that can occur on a server and how they are typically related.
The right approach for your organization will depend on a number of factors, such as the size and complexity of your system, the amount of data you have and the resources you have available.
Remember, no matter what approach you use, event correlation is an important part of any incident management strategy. By identifying and resolving incidents quickly, you can minimize the impact on your organization and keep your customers happy.
Want to see BigPanda event correlation in action? Request a demo here.