In the past decade, ServiceNow revolutionized IT service assurance. The SaaS platform brought order to the chaotic process of incident management, helping thousands of companies grow their uptime while improving efficiency.
Recent shifts in the architecture of production environments have introduced new challenges. More and more companies are finding that relying on ServiceNow in isolation is not enough to maintain high SLAs in cost-effective ways.
What are some common problems experienced by companies?
- NOC Hyper-growth – the NOC headcount is growing faster than the business
- Late Incident Detection – critical issues are discovered only after customers are affected
- Broken Collaboration – remediation processes are hindered by organizational finger-pointing
Fortunately, none of these issues is beyond repair. By leveraging a data science approach to service assurance, you can eliminate these problems, while significantly boosting the value you get from using ServiceNow.
The Data Explosion
If you have talked to a NOC operator or a level-1 engineer recently, then you have surely noticed the sweat on their foreheads and the gloom in their eyes. Operators of modern production environments are expected to handle more IT events than is realistically feasible.
The current situation stems from two major trends in the architecture of modern data-centers. First, virtualization, clouds, and containers have all enabled companies to run huge server fleets to support their applications. Second, adoption of service-oriented software topology has resulted in highly fragmented applications. This progress did not come without a cost, the cost being an unprecedented explosion in the amount of operational data generated.
Diving deeper, it is easy to see how this data explosion impedes IT remediation processes. The traditional process begins with monitoring software detecting misbehaving metrics and alerting level-1 responders. The responders prioritize the issue, then perform initial investigation, follow through a runbook, and, when necessary, route the incident to level-2 and level-3 engineers.
However, when alerts occur tens or even hundreds of times per hour, all assumptions collapse. It is impossible to identify and prioritize events within an alert flood without critical issues falling through the cracks. Beyond that, initial investigation of issues consumes time, forcing companies to hire more and more level-1 responders to work in parallel.
The truth is that tools such as ServiceNow were never designed to handle such a high bandwidth of machine-generated incidents. A complimentary solution is needed.
Alert correlation offers a pragmatic solution to the problem of high alert volume. With very little configuration, it can cut down the number of monitoring alerts by more than 95%.
This is how companies use alert correlation with a tool such as BigPanda:
- All your monitoring tools are plugged into BigPanda
- BigPanda consumes and normalizes thousands of alerts every minute
- Related alerts are correlated into high-level incidents
- BigPanda populates incidents into ServiceNow
- ServiceNow incidents are continuously updated with new information and insights
You can use alert correlation to consolidate data from tools such as Nagios, SolarWinds, AppDynamics, Zabbix and many others. Additionally, you can use an API to push your custom events into the correlation tool.
To better understand the mechanics or alert correlation, consider the following example. At 9:15 am an engineer deploys a new version of the billing application. As is often the case, the new code contains a particularly heavy database query.
At 9:22, a database cluster is beginning to experience heavy load due to the problematic query. The NOC starts receiving cpu, memory and disk load alerts for the cluster. All of the cluster’s 25 servers are alerting. At 9:27 the billing application triggers a transaction-latency alert. Soon the message queues are filling up resulting in even more alerts.
Between 9:22 and 9:45 the NOC receives north of 100 alerts, all tied to just one issue.
With alert correlation, this entire alert storm would have been transformed into less than 5 incidents! Similar alerts, coming from the same cluster and occurring in a short time frame, are grouped together automatically.
With less than 5 incidents, the NOC operator can breath a sigh of relief and concentrate on remediation. Resolution time goes down while cost-efficiency goes up.
BigPanda ❤ ServiceNow
BigPanda is the leading alert correlation engine on the market. It is the only tool to provide up to 99% alert compression without loss to accuracy.
In addition to alert correlation, BigPanda also provides powerful data analytics capabilities, helping you identify important incident trends, measure MTTR, and more.
BigPanda partnered with ServiceNow to provide seamless integration between the two systems. Setting up BigPanda takes just a few minutes. You can find the BigPanda-ServiceNow Connector through the ServiceNow Store.