Any post about monitoring tool sprawl and too many alerts.
We all need to move fast in order to stay competitive. But the faster things move, the faster things break.
While many companies have made great strides towards automating application release and infrastructure management, automation for service assurance has been sorely lacking. That’s left Dev and Ops with a problem: how to effectively service alerts that have grown by orders of magnitude.
Sam’s a father of two boys living in the bucolic LA suburb of West Covina. He’s a family first guy who paints model military cargo planes for fun, makes award-winning paella, hates his commute, and loathes his phone between the hours of midnight and 4:00 AM.
Sam was a kid when he joined News Corp as a help desk analyst in 2000. More than 15 years later and he’s now Sr. Director of IT managing a growing team of 30 NOC engineers, sys admins, and DBAs. Over the years, he has received more promotions than Trump on his own Twitter feed by delivering results and never wavering from two core beliefs that influence everything he does:
In the last two decades, with the emergence of cloud infrastructure and SaaS delivery models, the monitoring ecosystem has changed dramatically to include over 100 monitoring solutions. The upside of that change is the rapid implementation of monitoring infrastructure, but the unintended consequence of this is that the tools themselves decide what IT measures.
In my last post, I discussed how enterprise application sprawl, if left unchecked, puts organizations at risk. In this post, I’m going to discuss what to do about the problem. Today, any single department within even a mid-market enterprise will have more applications deployed than was standard – organization wide – just a dozen or so years ago. These apps include everything from cloud-based CRM to social media tools to AWS workloads to various big data tools to collaboration suites, and on and on and on.
Whether we practice more traditional operations processes with a 24x7 NOC and well-documented processes, or we’re embracing DevOps-styles with cross-functional teams and highly iterative methodologies, one problem we all face is the growing disconnect between our monitoring systems, the alerts they fire off, and the processes we’re using to handle operational issues. We log incidents in a ticket, but are the folks working on that ticket aware of the real-time status of the underlying incident?