Event Correlation, powered by AIOps

What is Event Correlation?

The typical enterprise has invested in 15 or more observability and monitoring tools. These tools provide their IT Ops, NOC, DevOps and SRE teams deep visibility into critical applications, systems and infrastructure, both on-prem and in the cloud.

But these tools also generate very large volumes of IT alerts that, combined with topology and change data streams, overwhelm those teams. Because they’re overwhelmed, these teams can’t quickly and easily detect IT incidents until it’s too late and these incidents have escalated into crippling outages.

Event Correlation

BigPanda’s Event Correlation, built for complex, modern IT environments

BigPanda’s Event Correlation aggregates and correlates disparate streams of observability, monitoring, change and topology data into context-rich incidents.

Using 50+ out-of-the-box integrations and powerful REST APIs, BigPanda connects to existing observability and monitoring tools and aggregates their data in real-time. To date, BigPanda has integrated with 300+ unique tools. In addition, BigPanda’s SNMP agent collects alerts (SNMP traps) from tens of thousands of IT systems and devices. The system normalizes the data into a consistent format and adds context by bringing in topology and operational data.

Using Open Box Machine Learning, BigPanda then correlates the collected alert and topology data into a handful of context-rich incidents, dramatically reducing the noise. By helping teams detect and take action on incidents in real-time, as they form, BigPanda prevents those incidents from escalating into outages.

The challenges with correlating events today

Disparate data, manual aggregation

For enterprises, critical operational data (e.g. monitoring, change and topology) is distributed across dozens of siloed tools. Without a solution in place that can aggregate this data, teams are forced to constantly switch between those tools, and make sense of it manually.


Inability to make sense of the data without experts

Different tools use highly distinct formats and terminology to describe the same IT components. This makes it very hard for operators to consume their data in a consistent manner, and even harder to glean valuable insight from this data.

Lack of cross-stack visibility and context

Because monitoring tools are siloed from each other, it is difficult to connect the meta-data from one data stream (such as: location, store, datacenter, business service, etc.) with the meta-data of other data streams. This inability to connect the dots leads to limited visibility into the scope of incidents and outages, and their root cause. The result: costly and frustrating human interactions as different operators and team members try to decide what’s impacted and what to focus on next.


Massive noise

With 15+ siloed monitoring tools generating tens or hundreds of thousands of alerts each day, critical incidents are often too hard to spot in the sea of data exhaust. Teams are often made aware of an outage when frustrated customers, users, service owners and business units start to  complain.

Inability to prioritize incidents or understand their impact

Most enterprises experience thousands of incidents every week. Operations teams must easily be able to understand these incidents’ impact and prioritize response accordingly, before users and customers are affected. But because enterprises are drowning in IT noise, and because operational and business context is sorely lacking, it is very hard to do this.


Manual reporting and analysis of IT operational performance

To report on IT Ops data, many enterprises rely on error-prone, manually updated spreadsheets or general-purpose reporting tools that require extensive customization. Or, they rely on  homegrown/custom IT Ops reporting tools that are expensive and time-consuming to build and maintain. This makes it very hard for enterprises to track, measure and subsequently improve critical IT Ops KPIs and metrics.

Using 50+ out-of-the-box integrations and powerful REST APIs for monitoring alerts, changes and topology, BigPanda can collect and aggregate data from all monitoring, change and topology tools in real-time.

How BigPanda’s Event Correlation works

Normalizing data into a single and consistent format

BigPanda translates diverse IT data sets (such as alerts, changes and topology) into one consistent taxonomy, represented using general-purpose key-value pairs called tags. BigPanda performs this in real time using multiple out-of-the-box and custom normalization methods.

Enriching monitoring alerts with operational and topology data

BigPanda’s out-of-the-box integrations and REST API let teams collect contextual data from all sources of operational and topology data including CMDBs, asset and topology sources, infrastructure-as-code topology sources, APM / network maps, custom asset and process inventories. This data, once collected, is used to enrich your monitoring alerts. BigPanda’s native enrichment capabilities are robust and highly scalable, with the ability to enrich millions of records from sources with millions of records, every day. This is especially important for enterprises whose application topology data is scattered across several sources, as is the case with companies in the middle of application modernization and/or cloud migration.

Reducing noise without manual effort

BigPanda uses Open Box Machine Learning to correlate alerts, changes and topology data together, and reduces IT noise by 95%+. Operations teams can now detect evolving incidents as they happen before they escalate into crippling outages.

By not being distracted by IT noise, informational events or false positives, IT operations teams can catch evolving incidents as they happen, before they escalate into crippling outages that affect the business.

To tailor BigPanda’s Machine Learning for each enterprise’s unique needs and maximize correlation efficiency, BigPanda also provides users with unprecedented control over its Machine Learning logic, by letting users see the logic in plain English, edit it and incorporate their tribal knowledge, and test and preview results before deploying this logic into production. Open Box Machine Learning also allows enterprises to pragmatically adopt AI/ML to benefit from automation.

Easy impact analysis and prioritization

BigPanda’s Operations Console and Real-time Topology Mesh make it easy for operations teams to easily understand the impact of incidents and prioritize their response. The console uses environments to provide cross-stack views that filter by severity. The console also displays business context such as affected services and potential customer impact for each incident, in the intuitive incident view. The console supports practices such as Inbox Zero to help operators prioritize their “to-do lists” for incidents. And finally, the Real-time Topology Mesh helps users understand dependencies between apps/services and low-level infrastructure, so they can determine how best to prioritize their responses to incidents.

Easy impact analysis and prioritization

Out-of-the-box reporting for IT Operations

Out-of-the-box reporting for IT Operations

BigPanda Unified Analytics is purpose-built from the ground-up for IT Ops. It packages BigPanda’s domain-specific reporting and analytics experience, derived from years of helping the largest and most complex enterprises in the world report on their IT Ops data. It is based on a robust, fully customizable and scalable reporting and visualization back-end that can handle IT Ops data generated by some of the largest enterprises in the world.

BigPanda’s Unified Analytics provides enterprises with a library of out-of-the-box dashboards that can measure, track and display commonly-used IT Ops Key Performance Indicators (KPIs), metrics and trends in accordance with industry best practices. These include KPIs and metrics such as Compression and Noise Reduction Ratios, Impacted Applications, MTTx by Severity and Category, Team Performance, Top N Hosts, Top N Applications, Enrichment Rates, Recurring Incidents and more. BigPanda Unified Analytics supports all widely-used business intelligence and data warehousing platforms.

 

Watch this on-demand webinar and find out how to tell if you are running a chaotic or well-run incident management process.

Building the business case for BigPanda Event Correlation

To build a business case, quantify the negative consequences of having to manually sift through events to understand incidents. Here are some areas enterprises should examine and quantify in their environment:

How many FTEs does your IT Ops or NOC teams have (across all the shifts, and across your global locations)? What are their salaries and fully loaded costs? This helps you understand the value of each minute they spend on your IT Ops alerts, incidents and outages when you don’t have EC (Event Correlation in place).
For the IT Ops alerts you collect, what % is just noise / non-actionable / merely informational? How long does it take your IT Ops team to examine each such alert, understand what it means, acknowledge it and then archive/delete it? What is the cumulative man-hours your team spends doing this task? What is the associated cost?
How often are outages reported by teams outside of IT? Why did those get missed? What is the impact to the business?
What is the non-human cost of these outages (on critical business systems that might suffer downtime, revenue generating systems that are not generating revenue, Point of Sale systems that are down, Payment Processing services that can’t process payments, SLAs that are violated and then result in SLA penalties, etc.)?

If you have a homegrown or custom Event Management or Event Correlation solution, consider quantifying these costs:

  • Engineering time and costs required to build, or if it’s already built, maintain this solution
  • Hardware costs to host the solution
  • Admin time required to administer and maintain the solution
  • Admin time required to customize the solution on an ongoing basis to keep it in sync with new tools and business requirements
  • Engineering time required to add new features and functionality periodically
  • Engineering time required to upgrade the solution to help it scale with growth

If you have an ineffective legacy/commercial Event Management solution, consider quantifying these costs as you build a business case for Event Correlation:

  • Annual software licensing costs, if applicable
  • Annual software support and maintenance costs
  • The cost of having an army of FTE admins that must maintain the solution
  • The cost of bringing in expensive 3rd party System Integrators or consultants to handle every new business requirement or tool integration

So, what are you waiting for?