Automating Incident Management

Now, that’s not exactly news.  Many areas of data center management have already embraced automation, enabling IT teams to significantly scale their capabilities.  A few examples:

  • Configuration management (Chef, Puppet, Ansible)
  • Log management (Splunk, Sumologic)
  • Security event management (Arcsight)

Sadly, IT Incident Management is one aspect of data center management that’s remained behind the curve.  As data centers have exploded in size and complexity, so have the number of daily IT alerts that require attention.  Yet the way in which IT teams manage and respond to those alerts has not kept up.  Alert management is still highly manual, slow and repetitive work.  We founded BigPanda to fix this problem.

A good analogy for the current state of Incident Management is Configuration Management.  Configuring servers by hand was possible when you had a handful of on-premise machines.  Not so, when the number of machines requiring management has exploded.  It’s no longer feasible to configure and manage every server by hand.  Too slow.  Doesn’t scale.  That problem spawned a new generation of CM tools (CFEngine, Chef, Puppet, etc.) whose central tenet was: automate in order to scale.

The problem facing Incident Management is similar.  Manually managing and responding to IT alerts by hand made sense once.  But things have changed.  IT teams are facing new challenges:

  • Scale: data centers growth is accelerating and so is the scale of daily IT alerts that require attention.
  • Fragmentation: Companies have moved away from monolithic monitoring solutions like HP or IBM, towards using multiple tools such as Splunk, New Relic, Nagios, Zabbix and Pingdom. When IT issues occur, correlating alerts and connecting the dots between all those tools is a time-consuming and error-prone task.

Bottom line: the scale and fragmentation of IT alerts have increased dramatically.  Unfortunately, the way teams respond to those alerts has not kept up.  Incident Management still consists of highly manual processes like:

  • Constantly filtering through noisy alerts to spot critical issues
  • Trying to connect the dots between alerts to understand larger trends
  • Prioritizing issues and escalating them to the right people
  • Correlating between alerts and recent changes to spot root cause
  • Manually creating, updating and managing tickets.

Wow!  That’s a lot of slow and non-scalable manual labor.

Incident Management today has hit a bottleneck and is ripe for automation.  It needs new tools and new approaches along the lines of what Chef and Puppet did for Configuration Management: enabling IT to scale their capabilities by automating away slow and repetitive manual tasks. For Configuration Management, automation came in the form of repeatable recipes.  For Incident Management, automation will come in the form of data science.  We believe that only through leveraging data science can IT teams tackle the scale of machines, events and dependencies that must be understood and managed. That’s why we founded BigPanda.

HomePage img1 FullScreenBigPanda650x435