Any post about problems like downtime, outages, missed SLAs, etc.
As applications and infrastructures become more complex, it's time to say goodbye to monolithic solutions and build a best in breed stack, with BigPanda.
Ask yourself these questions to find the right fit in an alert correlation platform.
To maintain operational visibility in modern IT environments, companies are abandoning monolithic monitoring solutions from legacy vendors in favor of a modern set of “best of breed” monitoring tools. Today’s average IT monitoring stack consists of about 6-8 tools, including at least one from each of the following categories: systems monitoring, end user monitoring, application performance monitoring (APM), error detection, log analytics, chat, and ticketing. When service disruptions occur, operations engineers face a flood of alerts across different layers of the IT stack, with no fast way to figure out what’s really going on. Customers are left stranded, while IT professionals struggle to detect, triage and remediate urgent issues. Downtime abounds which negatively impacts revenue, performance, and brand loyalty.
We all need to move fast in order to stay competitive. But the faster things move, the faster things break.
While many companies have made great strides towards automating application release and infrastructure management, automation for service assurance has been sorely lacking. That’s left Dev and Ops with a problem: how to effectively service alerts that have grown by orders of magnitude.
Sam’s a father of two boys living in the bucolic LA suburb of West Covina. He’s a family first guy who paints model military cargo planes for fun, makes award-winning paella, hates his commute, and loathes his phone between the hours of midnight and 4:00 AM.
Sam was a kid when he joined News Corp as a help desk analyst in 2000. More than 15 years later and he’s now Sr. Director of IT managing a growing team of 30 NOC engineers, sys admins, and DBAs. Over the years, he has received more promotions than Trump on his own Twitter feed by delivering results and never wavering from two core beliefs that influence everything he does:
For many IT and Ops teams, Nagios is both a blessing and a curse. On the one hand, Nagios gives you near real-time visibility into the inner workings of your IT infrastructure. But on the other hand, Nagios can generate so many alerts that it’s impossible for any single person (or even any team) to keep up.
If you’re struggling with a flood of Nagios alerts, this two-part blog series is for you. We’ll take a close look at the complicated relationship that IT and Ops professionals have with the monitoring tool, explain why Nagios is so noisy, and discuss the simple way that you take charge of your alerts and maximize the way Nagios works for you.
This is part two of a two-part post about using event correlation to thwart DDoS attacks. Channeling Mark Twain: it would have been shorter if I had more time. In the last post I described why DDoS attacks for SaaS providers are no different than performance and availability issues experienced in other domains like healthcare, finance, or retail. In this post I’ll share a customer story about a security breach that never happened… thanks to a savvy DevOps team and data science.
Why DDoS attacks aren’t just a security problem… and monitoring traffic isn’t the solution – Part One
Every company’s a target, every customer’s at risk. But the now-cliched threat of data breaches from Distributed Denial of Service (DDoS) attacks obscures a bigger threat: outages that impact not just data integrity but also profitability, brand equity, and customer retention.
The volume of attacks is growing and so is the impact of down time. According to Akamai’s most recent State of the Internet report, DDoS attacks are a bigger threat than ever before. “The number of DDoS attacks continued to increase substantially in Q2 2015, more than doubling the number observed in Q2 2014.”
Tsunami detection. Crop dusting. Biohazard monitoring. What may sound like innuendos in the next EL James novel are also fields being revolutionized by quant jocks and smart algorithms. And yet, despite all the innovation, we technorati continue to bastardize the terms “data science”, “machine learning,” and “big data”. They’ve become lazy speak for “we’re not sure what we’re doing so we’ll hand wave cliches until we have real technology and a business model."
Today we announced an integration with the excellent cloud monitoring system Librato which was recently acquired by SolarWinds. We’ve enjoyed working with the Librato team to bring the product to market and now are eagerly awaiting feedback from the loud and proud community of BigPanda+Librato users.
I met Vlad in the bar in Vegas after a long day of telco NOC drudgery. He was enjoying his whisky and clearly didn’t want to be interrupted by me asking about his datacenter. I could tell he’d rather I had asked about anything else… Cat Stevens, Greek myths, Faberge eggs. Anything. I interrupted him anyway and asked what’s required to go from the three nines he referenced in his keynote to the five nines his customers demand. He winced in pain. I thought he swallowed an ice cube or his Johnnie Walker was laced with cyanide. Turns out he was deep in thought. He proceeded to share wisdom that inspired me… to drink whisky and grow facial hair.
Earlier this month at BigPanda we released our new Sharing feature, which allows NOC teams to quickly share active and critical incidents with the right teams and subject-matter experts. BigPanda already helps NOC teams today by giving them instant visibility into incoming related alerts so that they don’t have to sift through dozens of emails and web pages with every outage or disruption. They can also attach playbooks and timeseries graphs directly to BigPanda, which means no more navigating around, combing through bookmarks, trying to find the right wiki page for that memory issue, or the right Graphite link for that misbehaving database host.
BigPanda is an incident management platform for modern IT, Ops, and DevOps teams. With BigPanda, you will prioritize and route your incidents better and faster, while vastly improving your team's collaboration and processes. This is part 2 in a series on Getting Started with BigPanda. This guide will help you get up and running quickly and maximize the value you get out of the platform.
BigPanda is an incident management platform for modern IT, NOC and DevOps teams. With BigPanda, you will prioritize and route your incidents better and faster, while vastly improving your team’s collaboration and processes. This is part 3 in a series on Getting Started with BigPanda. This product introduction will help you to get up and running quickly so you can get back to hunting fail-whales and 404 errors.
BigPanda is an incident management platform for modern Ops environments. With BigPanda, you will prioritize and assign your incidents better and faster, while vastly improving your team’s collaboration and processes. This is part 4 in a series on Getting Started with BigPanda. This guide will help you get up and running quickly and maximize the value you get out of the platform.
The last ten years have brought enormous changes to production environments, driven by a best-of-breed approach to production infrastructure enabled by open source and cloud. This has been a boon for developers in terms of flexibility and productivity, but it’s also placed a new set of challenges and expectations on Ops.
Monitoring applications in production has never been easier. With only a few code lines, you'll have New Relic installed and monitoring your application from nearly every angle. When something goes wrong, New Relic will start sending alerts. But then what? (hint – New Relic and BigPanda together is the answer).
Anomaly detection for monitoring has been a trending topic in recent years. And while the math behind it is fascinating, too much of the discussion has revolved around histograms, moving averages and standard deviations. More discussion needs to happen around its practical applications, and for that reason, this practical guide to anomaly detection will attempt to provide an actionable overview of current off-the-shelf anomaly detection tools.
Many alerts place an unnecessary burden on Ops teams instead of helping them to solve issues. The main problem is that most alerts are not actionable enough:
- They point to issues that don’t require a response
- They lack critical information, forcing you to spend time searching for more insights in order to gauge their urgency