Cambia automatically correlates alerts to identify actionable incidents for priority resolution.

“We’ve automated an average of 83% of alerts that come into BigPanda. Meaning the bulk of our alerts now get resolved automatically or receive a ticket without our team having to manually investigate it from beginning to end.”

Mark Peterson
IT Operations Supervisor, Cambia Health Solutions

83%
alert compression
95%
of NOC overall SLA met
91%
of NOC critical-alert SLA met

Cambia Health Solutions is dedicated to making the healthcare experience simpler, better, and more affordable for people and their families, including the more than 3.4 million people served through its regional health plans.

Challenge

  • Homegrown, legacy event management solutions didn’t correlate events or reduce alert noise.
  • Alerts could not be automatically enriched to provide the context necessary to triage and prioritize alerts, forcing the network operations center (NOC) to rely on tribal knowledge for incident response.
  • The inability to automate any manual event management processes or incident triage workflows led to an average 30-minute mean time to resolve (MTTR) per incident—2x their committed 15-minute service level agreement (SLA).

As a healthcare provider, it is crucial for Cambia to ensure that they maintain a high availability of uptime to meet the complex needs of their patients. “A huge focus for my team, being in the health industry, is recognizing the personal impact of these individual alerts. When an outage occurs, it affects people who are in need of medical care. When a server is down, it could mean delays in medical claims being processed, pre-approval for a procedure, and so on. So it is really important to us to be as efficient as possible with event management,” explains Mark Peterson, IT operations supervisor at Cambia Health Solutions.

Cambia’s NOC relied on a homegrown event management system for its IT operations. It was not designed to support the long-term adoption of multicloud strategies and the rapid influx of disparate event and alert data from point monitoring solutions. It lacked any event deduplication, filtering, or correlation capabilities that would reduce event noise and help the NOC prioritize their incidents, resulting in 2x average delays of SLA commitments.

Instead, the NOC team used valuable time chasing hundreds of seemingly disconnected alerts without realizing they could all be part of a single incident. The increasing volume exacerbated the team’s struggle to triage and resolve incidents before escalating into an outage, which took away from their ability to deliver consistent uptime of online services to healthcare members.

“There was a lot of manual effort to our process and very low visibility into what was happening across our broader teams. Furthermore, we lacked context on how to categorize and prioritize alerts, forcing us to continue to rely on tribal knowledge to resolve an incident. This deficit really made its impact known during a 2am outage when we would have to wake people up for a bridge call to help us triage the incident,” says Peterson.

Solution

Cambia needed AIOps to reduce the volume of IT noise coming into a single pane of glass and automatically correlate those alerts into actionable incidents for priority resolution.

Cambia recognized the advantages of introducing BigPanda Incident Intelligence and Automation, powered by AIOps, to combat their fragmented and noisy alert environment. BigPanda ingests its monitoring and change data and then follows a process of normalizing, deduping, and filtering events into a consistent format. Technical details on each alert, like host name and continuous integration and continuous delivery/deployment (CI/CD) information are enriched to alerts to drive better correlation and noise reduction. It also helps identify actionable alerts that need immediate attention from the NOC. Through this process alone, Cambia’s NOC team is now able to adhere to the 15-minute service level agreement (SLA) to identify critical alerts and incidents, which previously took 30 minutes on average.

“BigPanda has helped significantly with deduplicating, correlating, and automating our process. We have a better understanding of what is impacted throughout the organization and how to fix it quickly. This has been huge because it has given time and resources back to the NOC,” says Peterson. “The enrichment data that we process through BigPanda is allowing us to create more specific and insightful alert tags, which is helping us reduce the need for tribal knowledge. We get better context responses to alerts that are coming in and get the right teams involved for a far faster resolution time.”

Benefits

Cambia’s newfound visibility through the reduction of alert noise and enhanced alert enrichment strategies allowed the NOC to automate their processes; critical alerts can now be identified within 30 seconds. BigPanda further enriches correlated groups alerts, called incidents, with business context, like priority, team, and category. This allows Cambia to proactively create one ticket in their ITSM tool (instead of a ticket for each event) and assign it to the correct response team for incident resolution. This eliminates long and expensive bridge calls with all response teams at any time of day or night and expedites the proactive identification and response of critical alerts.

  • Automated their process to 83%, enabling identification of critical alerts within 30 seconds
  • NOC overall SLA met: 95%
  • NOC critical alert SLA met: 91%

“We’ve automated an average of 83% of alerts that come into BigPanda. Meaning the bulk of our alerts now get resolved automatically or receive a ticket without our team having to manually investigate it from beginning to end,” says Peterson.

“There’s a real personal impact that our NOC team can provide to Cambia’s individual members. When members are trying to get to our site or trying to call customer service with questions, we know they could be dealing with a rough time, medically, in their lives. The smoother we can make that whole process for them, by having our systems up and running properly, not having system delays, etc., it positively impacts efficiency when they need it most.”