The underlying issue? One database instance somewhere in Salesforce’s massive data centers had a failure which resulted in file integrity issues, that manifested as the platform being down for a large number of customers, worldwide. It took Salesforce more than a whole day of downtime to detect root cause, and even after that, the issue could not be resolved. Salesforce continued to struggle in the face of public denigration on social channels. And finally, gave up on fixing the issue, and focused on rolling to a prior backup. After the rollback, customers were left stranded with five hours of permanent data loss.
Fig 1: Salesforce system status of NA14 showing outage
What could Salesforce have done differently? Salesforce could have sped up incident detection and brought down the mean time to resolution with an upfront investment in an Alert Correlation Platform.
The Challenge: Today’s data centers have changed dramatically. The sheer scale, rampant fragmentation, and relentless pace of activity have increased by orders of magnitude over the last 10 to 15 years. According to the 2016 State of Monitoring survey, most organizations receive thousands of alerts every day from an average of six different monitoring tools. Each alert indicates a potential problem. Unfortunately, it can be difficult or impossible for IT professionals to make sense of all these alerts in time to identify the truly important incidents and resolve them quickly.
The Solution: Alert correlation is a method of grouping related alerts into high-level incidents, so IT professionals can easily and quickly identify urgent issues that require their attention.
Without alert correlation technology it is difficult to understand the impact of an incident, let alone determine what it costs or assess it. Secondly, it is difficult to focus your troubleshooting efforts. In the case cited above, should you start by analyzing the database issues or application issues or load issues on the servers? How can you tell if multiple alerts are related to each other? Without alert correlation, important issues may fall through the cracks leading to extended outages and undetected downtime costing millions of dollars, disgruntled customers and lost reputation.
BigPanda automatically correlates IT alerts into high-level incidents to help you improve detection, accelerate remediation, and increase productivity. BigPanda’s does this by first aggregating and normalizing alerts generated by multiple monitoring systems. It then groups related alerts into high-level, actionable incidents based on specific systems, applications, topologies, and other variables.
- Improve Detection: Due to massive volumes of IT alerts, important incidents often fall through the cracks. When alerts are intelligently grouped and correlated, critical incidents are easier to find.
- Accelerate Remediation: Alert correlation gives you the full context of each incident instead of just one data point. Resolution times are faster because fewer alerts get lost in the stream of noise – so you can resolve incidents before they affect customers.
- Boost Productivity: Correlating alerts makes it easier to manage emergency situations by reducing the number of items you have to concentrate on. You can troubleshoot issues quickly, instead of engaging in repetitive manual correlation.
Fig 2: BigPanda shapshot of one incident correlated against 12 alerts
How does BigPanda Work?
Out-of-the-box integrations make it easy to connect your monitoring systems to our cloud-based platform.
- Quick Connections: It only takes a few minutes to integrate your existing monitoring tools (such as Nagios, New Relic, AppDynamics, Splunk, Zabbix, Cloudwatch, Pingdom and more).
- Data Normalization: BigPanda analyzes data from all of your monitoring tools, then normalizes it all into a unified data model to capture and synthesize relevant information.
- Rapid Correlation and Customization: BigPanda groups alerts into actionable incidents without requiring you to define rules or build a dependency model. As a result, you’ll see great results within minutes of connect BigPanda to your environment. You can always further customize BigPanda’s correlation rules to suit your specific systems, applications, topologies, and division of duties.
Case in Point
BigPanda alert correlation is valuable for resolving network issues, host-based issues, and application-level issues, along with hundreds of other use cases.
- Network Issues: There are many types of network outages, but they are all notorious for generating excessive noise. The crash of a single router can easily yield 500 alerts. BigPanda can correlate them all into a single incident.
- Host-based Issues: Tightly related alerts can represent dozens of metrics from a single host including low memory, page faults, memory utilization, and high CPU loads. BigPanda can group alerts based on standard or customized logic to streamline troubleshooting and reduce noise.
- Application-level Issues: There may be many hosts related to one application—such as a Java billing application that runs in a Tomcat application server spread among 50 hosts. BigPanda can group alerts related to the same application, no matter what host or what data center the alerts arise from.
Interested in learning more on what an IT alert overload can cost your organization? Download our white paper, The Three Hidden Costs of IT Alert Overload. Ready to get started with up to 99% alert correlation? Start a free trial with BigPanda.