Quick reads from the Value and Adoption team: The Importance of Self-Healing (and How It Works)

3 min read
Time Indicator

Most folks familiar with BigPanda know that automation is a foundational block of our technology. Our platform automates the entire events pipeline with functions including standardizing and deduplicating alerts, cutting down on the volume of incidents, and automated enrichment that provides better context and alert payloads. But these are all part of an inbound flow of events through integrations. Our automation technology includes outbound integrations as well, which most companies use for ticketing purposes. But a less-well-known use case for outbound integrations is kicking off downstream automations.

An example of outbound is using an environment to communicate with a self-healing automation platform. In this environment, no human operator is needed—it just receives the incident and automatically shares it to a downstream system such as a self-healing automation platform. This functionality represents L0 automation. L1 is your first tier of resources that receive your incidents, but L0 is the self-healing level that occurs for incidents that requires no human operator’s touch.

How Does Self-Healing Work?

BigPanda is still receiving events—we’re still sending them through maintenance APIs, enriching them with tags, etc. And we’re still suggesting root cause changes. However, we’re also taking a purposeful look at the types of incidents that are generated. (We usually talk to operators to understand what these incidents are and ask them to identify a subset of incidents they receive on a regular basis that might be candidates for self-healing automation.) So everything still flows into BigPanda in the same way, but we’re creating an environment that’s designed to share out to a self-healing platform like Rundeck, Ansible, Chef, or Ayehu.

We then include a job ID—either through alert enrichment or incident tagging—to communicate with the self-healing platform, so when we share outbound to one of these self-healing platforms, the incident payload contains that job ID. That system now understands what job it’s about to run.

The last step is making updates to the jobs that are configured in the self-healing automation platform. We add an additional component to that job so it communicates back to BigPanda if the job was a success or failure. If it’s successful, it leaves a comment on the incident in BigPanda notating that the system attempted automated self-healing and the result was success. With that success, monitoring is now able to clear the alert, and the incident automatically closes in BigPanda if all alerts contained in that incident are resolved.

If the job is not successful, the comment can still come through enabling an operator to see that this technically is an incident defined for self healing but the attempt at self healing has failed. That becomes additional context the operator can now utilize when they’re going in and taking additional steps in order to heal the incident.

This technology gives an organization the ability to always keep an eye out for which incidents are candidates for L0 automation—effectively reducing the load on human operators and creating efficiencies throughout all levels of IT Ops.

Interested in how to start identifying candidate incidents for L0 automation in your business? Reach out to your CSM today to set up a meeting with the Value and Adoption team at BigPanda. We’d be happy to walk you through the first steps.