Any post about monitoring tool sprawl and too many alerts.
We all need to move fast in order to stay competitive. But the faster things move, the faster things break.
While many companies have made great strides towards automating application release and infrastructure management, automation for service assurance has been sorely lacking. That’s left Dev and Ops with a problem: how to effectively service alerts that have grown by orders of magnitude.
Sam’s a father of two boys living in the bucolic LA suburb of West Covina. He’s a family first guy who paints model military cargo planes for fun, makes award-winning paella, hates his commute, and loathes his phone between the hours of midnight and 4:00 AM.
Sam was a kid when he joined News Corp as a help desk analyst in 2000. More than 15 years later and he’s now Sr. Director of IT managing a growing team of 30 NOC engineers, sys admins, and DBAs. Over the years, he has received more promotions than Trump on his own Twitter feed by delivering results and never wavering from two core beliefs that influence everything he does:
In the last two decades, with the emergence of cloud infrastructure and SaaS delivery models, the monitoring ecosystem has changed dramatically to include over 100 monitoring solutions. The upside of that change is the rapid implementation of monitoring infrastructure, but the unintended consequence of this is that the tools themselves decide what IT measures.
In my last post, I discussed how enterprise application sprawl, if left unchecked, puts organizations at risk. In this post, I’m going to discuss what to do about the problem. Today, any single department within even a mid-market enterprise will have more applications deployed than was standard – organization wide – just a dozen or so years ago. These apps include everything from cloud-based CRM to social media tools to AWS workloads to various big data tools to collaboration suites, and on and on and on.
Whether we practice more traditional operations processes with a 24x7 NOC and well-documented processes, or we’re embracing DevOps-styles with cross-functional teams and highly iterative methodologies, one problem we all face is the growing disconnect between our monitoring systems, the alerts they fire off, and the processes we’re using to handle operational issues. We log incidents in a ticket, but are the folks working on that ticket aware of the real-time status of the underlying incident?
Enterprise application and computing environments have changed radically over the past fifteen years. Anyone who has spent even a day in an IT role can tell you that.What gets less attention, however, is how those changes undermine the ability of operations teams to do their jobs. The problem is that as computing and application environments have changed dramatically, workflows and org charts have not.
Earlier this month at BigPanda we released our new Sharing feature, which allows NOC teams to quickly share active and critical incidents with the right teams and subject-matter experts. BigPanda already helps NOC teams today by giving them instant visibility into incoming related alerts so that they don’t have to sift through dozens of emails and web pages with every outage or disruption. They can also attach playbooks and timeseries graphs directly to BigPanda, which means no more navigating around, combing through bookmarks, trying to find the right wiki page for that memory issue, or the right Graphite link for that misbehaving database host.
We're excited to announce the release of a major new feature in BigPanda called Sharing! As you know BigPanda intelligently clusters your noisy alerts into high-level incidents. With our new Sharing feature, it's now easy to notify and collaborate with anyone on your team about critical incidents.
For those of you who are not familiar with Jenkins, it's a dead simple open sourced Continuous Integration solution, which takes absolutely no time to set up. Jenkins has a vibrant ecosystem and community, and until recently, Jenkins only had 999 plugins available...
Data center growth over the last 15 years has created significant growing pains in terms of data center management. Tasks that once could be done manually by IT teams have hit the limits of scalability, cost, and efficiency. The key to enabling IT to meet these challenges involves one key theme: automation.
Modeling your production environment correctly is very important for development. Developers need to be able to run and test their code locally for the development process to be efficient, and many times this requires setting up infrastructure that exists in production on their local machines. The basic solution is a simple Vagrant box containing all your infrastructure and application code, like the one we mentioned in our Devbox post.
Monitoring applications in production has never been easier. With only a few code lines, you'll have New Relic installed and monitoring your application from nearly every angle. When something goes wrong, New Relic will start sending alerts. But then what? (hint – New Relic and BigPanda together is the answer).
Last week was an exciting week. BigPanda announced $7 Million in funding from Sequoia Capital and Mayfield. We are super excited that these two firms share our vision for changing the way that IT and DevOps teams manage and respond to the thousands of IT issues they face every day. Last week, we also launched our offering into general availability. Check out some of the highlights from last week’s coverage on BigPanda from TechCrunch, GigaOm, Computerworld, 451 Research and more.
It’s well known in IT operations that things don't break on their own. Close to 80% of production outages occur because of changes made by developers or someone in IT. However, this fact often eludes us when it comes to actually resolving production issues.