The initial setup is usually small, you’re the only person to receive the alerts, and every alert counts. Life is a dream. Nagios does a good job of monitoring your different services. But the honeymoon period doesn’t last for very long: things quickly move from great to okay to unmanageable. Before you know it, tens or even hundreds of email alerts are pouring in every day. You try to make sense of this endless tidal wave, but it’s an uphill battle. Of course, alerts don’t have to be like that. He are a few aspects of effective alert management, and why email is a poor tool for it:
- Dynamic Content, Please – not static. Emails cannot change after they are received; alerts change all the time. This means that you will get an email for every status change, making it very hard to understand what happened and what the current status is. Looking at the last email is a reasonable solution, but it won’t be easy if you have to sift through tens of emails from last night alone.
- Application and Cluster-Level Alerts – Nagios alerts are based on services and hosts, which means that if a server has multiple problems, a dedicated email will be sent for each problem. You might try to solve this by defining dependencies, but in modern environments, the interesting entities are Applications and their Clusters, not just specific servers. For example, in a cluster of a hundred servers, if only one has a problem, chances are that everything is still working as expected – no reason to work all night because of it. If fifty are down, it is critical to be alerted, but getting fifty alerts would be unmanageable. It would be ideal to receive only one alert about the problematic cluster, how many servers are affected, and how many are still live.
- Context – Usually, the information given with the alert isn’t enough for solving or fully understanding the alert. Getting additional information on the problem at hand can greatly reduce the time to resolution. For example: if there is a load on the server, it can be very helpful to see the CPU graph of the last hour and the result of a top command. Part of this can be achieved with a notification command in Nagios, but this is only the the tip of the iceberg. What if you could see a histogram of the latest occurrences of this problem, or a list of all the changes that occurred in your system at around the same time?
- Actionable – Getting only the content isn’t enough. As a start, after receiving an alert, I want to specify that I’m working on it — or if this alert should be handled by someone else, assign it to them right from the alert. Taking this to the next level, there are plenty of other possible actions, such as manual resolve, snooze, share and change priority.
- Team Collaboration – Working in a team makes everything better, but handling Nagios alerts in a team is a real pain. Looking at the email, how can you know if someone has already figured out the problem? How do you share information on an open alert? Or how do you see the notes from the last time the alert happened? All of these relatively obvious options are simply not possible with email.
Nagios is highly customizable. With the help of a few plugins and some advanced configurations, the situation can get a lot better. Getting the hang of all of the possibilities and constantly maintaining them is a full-time job. To name a few: dependencies, flapping alert detection, Nagios email reports, check_multi, check_mk, proprietary notification commands and so many more … But the question is, should you manage your alerts in your email to begin with? I’d love hear what you think. How do you manage your alerts? At BigPanda we’re dedicated to solving these pains. Want to see how? Sign up here.