We’ll get there… but until we do, the best way to make better ops decisions faster is to make data tell stories based on an awareness of infrastructure health. Data becomes information when we can associate meaning with patterns – that is, when strings of 0s and 1s indicate business impact.
For instance, data – CPU usage spiking on a node – might mean anything. Information is when knowing it spiked two standard deviations above normal on the master Redis cluster means swap the paging file then restart sentinel. That’s a pattern associated with action and impact. Anomalies are just that until integrated with business logic and correlated with service health.
That’s what we’re launching today: BigPanda Service Health Analytics. It’s a new concept we pioneered that finally moves beyond dashboards as eye candy to a new place where analytics can be used to make better decisions. Service Health Analytics exposes all data from all monitoring sources in the form of configurable dashboards. It provides these pre-configured reports:
- Top alerting hosts
- Top alerting checks
- Correlation ratios from raw alerts to incidents
- Mean time to resolution for incidents
All dashboards provide two global filters – Environments and time periods – and all customizations can be named and saved for quick access. Pre-configured and custom dashboards can be shared externally with team members via one-click Snapshots.
I’ve seen hundreds of dashboards from at least that many vendors. I’m proud to say Service Health Analytics is the most elegant, easy to use design for IT ops. Give it a try in your BigPanda org or sign up for a free trial and let me know what you think. There’s a feedback widget on the dashboard and we review (and respond to!) every comment or suggestion.
Not surprisingly, we use BigPanda Service Health Analytics extensively to understand how the world manages IT infrastructure. Here are three examples of the insights it provides that help us run our business:
- The most popular monitoring stacks consist of six to eight tools (modal value – observed in 72% of all IT orgs) and include at least one from each of the following plus other specialized or custom tools: an open source systems management tool (Nagios, Zabbix, or Icinga) combined with an APM tool (New Relic, AppDynamics, or Dynatrace), an error detection tool like Sentry and a log analytics tool (typically Splunk or, increasingly, Logstash).
- The noisiest tools in the stack based on aggregated ratios of raw alerts to correlated incidents are Nagios, CloudWatch, and Splunk (average compression ratios above 97% for those tools vs. 92% for all tools).
- There’s an unexpected correlation between NOC team size and incident volume. Adding NOC engineers actually increases incident volume. We need to investigate that relationship (correlation is not causation!) but anecdotally it seems larger organizations have more complex infrastructure distributed across more datacenters that generates more alerts… and they tend to compensate for investment in process with investment in people.
DevOps will transition to NoOps when we all trust systems to self-diagnose and self-heal. With the rise of sensor networks, personal area clouds, and IoT, that future is closer than you’d think and yet it’s still three to five years out. Until then, we’ll rely on Service Health Analytics to distill data into information and information into better ops decisions made faster.