Stop Managing Ops Incidents with Jira or Zendesk
For example, consider the process of initiating an issue or a ticket. Unlike most engineering tasks, the vast majority of production incidents are triggered automatically. Companies depend on tools such Nagios, New Relic, Splunk and Pingdom to detect problematic symptoms. These symptoms are then analyzed, correlated and grouped into incidents. Incidents are dynamic, short-lived and occur dozens or hundreds of times a day. It is easy to see why a manual process is too slow and error-prone.
Another good example is operational insight. To correctly prioritize and route an issue, some context is required. What deployments and config changes occurred around the same time? When did the same issue occur in the past and how often? Were critical applications or web-sites impacted? These questions, sadly, cannot be answered by a generic issue tracker.
A good incident management platform integrates intelligently with the company’s monitoring stack, providing realtime context for all your incidents. It consumes and leverages information residing in various infrastructure data hubs (e.g., a configuration management system such as Puppet or Chef). It delivers automation where appropriate, replacing critical, yet inefficient manual processes.
It is not unusual to see companies implement their own incident management tools. Some start from scratch, others adapt existing issue trackers, augmenting them with scripts and hacking their code. In the next few years, I expect to see more and more companies migrate to out-of-the-box solutions, specifically tailored to cope with the complexities of incident management.