Build a boat, not a house: How event correlation strengthens your observability strategy

Build a boat, not a house: How event correlation strengthens your observability strategy

Date: September 8, 2021

Category:

Author:

Are you a modernizing enterprise IT shop? If so, which of these statements best describes your observability situation?

  1. We’re evolving or rebuilding our monitoring / observability stack, but it currently has gaps. We need to find the right tools to plug those visibility holes.
  2. Our monitoring / observability stack is complicated and messy. We’ve accumulated tools and technical debt, and we have some visibility redundancies and gaps we need to figure out.

If either of these statements describes your reality, you’re far from alone — we’ve talked to hundreds of companies in similar circumstances. In both cases, there may be strong inclination to “get our monitoring house in order” before optimizing things farther along the IT service management chain … things like event correlation, ticketing, or automations to shorten the incident lifecycle. 

It sounds reasonable, right? If the monitoring and observability layer is the foundation of your incident management tech stack, and the collaboration tooling layer is the house your people live in every day, doesn’t it make sense to build a solid foundation first, before putting up the house?

Here’s the thing, though. You’re not building a house. You’re building a boat. 

Why? Because the infrastructure on which your business depends is constantly moving and changing, like the sea. 

Unfortunately, since the conditions underneath you can change at a moment’s notice, your boat (and the people within it) has to be ready for anything. Those dynamic conditions could be coming from changes to monitored services, new observability tools being added, existing ones being retired, new BUs or acquisitions onboarding their systems, or rapid changes in the data inundating your team.

So if you accept that your monitoring and observability approach needs to be fluid and dynamic, how do you build the best boat you can? A modern event correlation solution powered by AIOps can help with that.

 

Building back better

Let’s consider an IT Ops Maturity Model focused on monitoring and event processing, and revisit our original question. Did you choose option 1? If so, you’re probably somewhere around Phase 1 in this model: 

With this model, you can see where your organization’s monitoring strategy lies today, and how to move your organization towards a more mature, autonomous observability posture. Organizations in Phase 1 are often working towards building out or modernizing their monitoring stack. Depending on other priorities, they may or may not be making progress. Meanwhile, visibility gaps persist that rob the team of the contextual data they need to quickly resolve incidents, further complicating their priorities.

There are other risks that go along with continuing to chase the “optimal” monitoring tool stack. Every time a new tool is evaluated or deployed, the incident data being fed to your response team changes, and the volume of incoming data goes up. This causes confusion, retraining, and capacity issues for the team. If a bad call was made, and a tool needs to be pulled back or replaced (hey, it happens), the ripple effects on your team could cause further problems. And as business requirements change and new services arrive to be supported, an observability environment that was approaching stability could be thrown back into chaos.

A modern, AIOps-driven event correlation solution, dropped in between your monitoring tools and your ticketing/collaboration stack, can be immensely helpful. It will immediately eliminate noise from your observability data sources and create actionable incidents instead, dramatically improving your IT Ops team’s bandwidth to handle issues as they arise.

But perhaps more importantly, a good event correlation solution can become a foundational piece of your team’s incident management workflow, providing a stable source of truth for the incidents that actually warrant their attention. 

Even if your organization standardizes around an existing system-of-record (such as a ticketing system like ServiceNow), its users will find that tickets will become less frequent and the ones that are created are much more useful when incoming alerts are filtered by an AIOps event correlation layer. This is because noise is reduced, alerts are bundled together, and records can be annotated with relevant data to accelerate incident response (such as recent changes, affected services, or deduced priority levels).

This means that monitoring and observability data sources can be added, removed, or swapped out like Lego bricks, with no negative impact to how IT operations detects or manages new incidents. That goes for new services that must be supported by IT operations as well; newly created applications and infrastructure feed data to IT operations using the very same AIOps capabilities, which correlates the new incoming alerts with the intelligence and efficiency it already provides for existing services.

The upshot of all of this is that changes to data sources, monitoring tools, or managed services effectively become frictionless events for IT operations, and a huge portion of the risk normally attached to those types of changes is eliminated. In other words, your boat becomes much more seaworthy no matter what is happening underneath the hull.

 

Rationalizing tools

On the other hand, maybe your IT organization has been around the block a few times. Perhaps you have a legacy stack of services to support, and a modern collection of cloud-based applications to bring online. Perhaps you’re inheriting new infrastructure through acquisitions that occurred above your pay grade. In any case, you chose option 2 above, because you have what could affectionately be called a bird’s nest of monitoring tools, along with all the chaotic, overlapping visibility problems that came with them.

Here’s that maturity model again; this time you’re probably somewhere around Phase 2:

You might be tempted to clean out your bird’s nest before adding an AIOps solution upstream. After all, garbage in equals garbage out, right? But actually, it turns out that bringing a modern AIOps-powered event correlation solution into the mix early can still offer tangible benefits.

For starters, AIOps-powered event correlation will offer significant noise reduction benefits that will return bandwidth to your team almost immediately.

They can use this newfound bandwidth to understand where the noisiest, dirtiest data sources are and correspondingly focus their efforts. They can also get insights on where their visibility gaps are, and research solutions to fill them. With every optimization, the team enjoys growing alert compression, improving visibility, and more actionable incidents.

Additionally, since event correlators roll up multitudes of alerts into individual incidents, they can provide a unique holistic view of your entire monitoring and incident management pipeline. If they provide the right dashboard or analytics views, your IT Ops team can now easily see which monitoring tools are providing value … and which ones are redundant. This can be valuable information when budget season rolls around (basically, every season) and you’re trying to reduce the amount of deadweight your boat carries around.

 

The need for speed

One common objection that often comes up relates to time. AIOps projects have historically received a bad rap for taking a long time (often, years) to return any value, and they have often fallen far short of the lofty goals initially set by the organization and the AIOps vendor. Knowing this, you may balk at delaying your efforts to modernize your observability stack in order to install an AIOps event correlation solution first.

Things today are very different, and the good news is that solutions with rapid time to value do exist. 

You can choose a vendor that understands your IT Ops use cases, has proven technology to allow go-lives in just a few weeks, and can promise attainable outcomes and PROVE to you that they have a solid basis for delivering them.

BigPanda has been helping option 1 and option 2 organizations – including some of the largest ones in the world – build better boats for years, with an AIOps-powered event correlation and automation solution that supports any evolving observability strategy. Most of our deployments complete in under three months – we invite you to reach out and find out why.