Here at BigPanda, we talk to many Ops teams. It’s an important part of our product development process, and helps us make sure that we’re focusing on the right pains for our customers. “Alert Spam” is a major recurring pain brought up by Ops teams: the constant flood of noisy alerts from your monitoring stack. We hear many explanations as to why teams receive so many alerts from so many systems, but the end result is always the same: overload. When you’re flooded with tens (or hundreds) of notifications a day, it becomes very hard to identify the pressing alerts that demand action. Among those pressing alerts, it’s often even harder to tell what should be tackled right now, and what can wait. This phenomenon was suitably named Alerts Fatigue at #Monitorama a few weeks back.
#1: Alert Per Host
What you see: 5 critical alerts from your server monitoring system, all at once.
What happened: Your caching layer consists of 20 servers. A new, faulty configuration was pushed to some of them, resulting in a torrent of low-memory alerts, one for each host.
In an ideal world: You’d receive one alert, indicating that 25% of your cluster has problems. And while we’re at it, if only one or two machines are down, the alert can wait for the morning. Ideally, thresholds would only be defined at the cluster or role level (see cattle vs. pets).
#2: Important != Urgent
What you see: Low disk space warnings for hosts X, Y and Z.
What happened: Nothing unexpected. After serving you well for three months, hosts X, Y and Z are slowly filling up with data. Maybe you should upgrade the disks. Maybe you should clean up some old data. But does it have to be now, in the middle of the night?
In an ideal world: Unless there’s a sudden growth in disk utilization, this is not an urgent matter. Instead of triggering an real-time alert, just send me a report every Monday. Include a list of hosts in my DC that have low disk space. Bonus points for adding a prediction of when free space will run out at its current pace.
#3: Non-Adaptive Thresholds
What you see: The same high-load alerts, every Monday, right after lunch.
What happened: You’ve worked hard to setup and refine your Nagios thresholds. Now they don’t alert you needlessly every day. But then comes that one weekday that’s always busy, which predictably triggers the alert. What do you do? You acknowledge and ignore it.
In an ideal world: There’s a rhythm to your traffic, and your monitoring system should be aware of it. If your load always goes up at 1pm, so should your thresholds. An alert should be generated only when there is unexpected load, otherwise it’s not actionable.
#4: Same Issue, Different System
What you see: Incoming critical alerts from Nagios, Pingdom, New Relic, KeyNote & Splunk…around the same time. Oh, and a growing number of customer complaints on ZenDesk.
What happened: Data corruption in a couple of Mongo nodes, resulting in heavy disk IO and some transaction errors. This is the kind of problem that can be seen at the server level, application level, and user level, so expect to hear about it from all your monitoring tools.
In an ideal world: You’d getone alert triggered by the system that captured the issue first. Following that, any other monitoring system that hits a related threshold should push its message to the same “incident thread”.
#5: Transient Alerts
What you see: Everybody has a few of these. The same issue pops up for a few minutes, every few days. It goes away just as fast as it appeared. And let’s face it, you’re busy enough as it is, so you’re probably not going to investigate it any time soon.
What happened: Maybe a cron job over-utilizes the network. Maybe a random race-condition in one of your applications deadlocks the database. Maybe a rarely-used feature of your product causes a backend process to crash.
In an ideal world: You would be able to mark issues for follow up. Then, you won’t hear from them again until the end of the month, when you’ll get a nice report showing at what times the issue normally occurs (as well as other alerts that normally occur around the same time and may be related).
What kinds of alert spam are you experiencing? Care to share your creative workarounds? Leave your feedback in the comments section below.