Any post about problems like downtime, outages, missed SLAs, etc.

Not all alert correlation platforms are created equal

By |2019-04-15T13:51:26+00:00May 23rd, 2016|Blog|

Ask yourself these questions to find the right fit in an alert correlation platform.

To maintain operational visibility in modern IT environments, companies are abandoning monolithic monitoring solutions from legacy vendors in favor of a modern set of “best of breed” monitoring tools. Today’s average IT monitoring stack consists of about 6-8 tools, including at least one from each of the following categories: systems monitoring, end user monitoring, application performance monitoring (APM), error detection, log analytics, chat, and ticketing. When service disruptions occur, operations engineers face a flood of alerts across different layers of the IT stack, with no fast way to figure out what’s really going on. Customers are left stranded, while IT professionals struggle to detect, triage and remediate urgent issues. Downtime abounds which negatively impacts revenue, performance, and brand loyalty.

How alert correlation helps Dev and Ops work better together

By |2019-04-17T02:42:21+00:00April 28th, 2016|Blog|

This post was recently published as a guest blog by our friends at Jira Service Desk. You can find the original post here.

We all need to move fast in order to stay competitive. But the faster things move, the faster things break.

While many companies have made great strides towards automating application release and infrastructure management, automation for service assurance has been sorely lacking. That’s left Dev and Ops with a problem: how to effectively service alerts that have grown by orders of magnitude.

Sam Kendall’s noisy alert problem

By |2018-04-17T18:15:30+00:00February 23rd, 2016|Blog|

Sam’s a father of two boys living in the bucolic LA suburb of West Covina. He’s a family first guy who paints model military cargo planes for fun, makes award-winning paella, hates his commute, and loathes his phone between the hours of midnight and 4:00 AM.

Sam was a kid when he joined News Corp as a help desk analyst in 2000. More than 15 years later and he’s now Sr. Director of IT managing a growing team of 30 NOC engineers, sys admins, and DBAs. Over the years, he has received more promotions than Trump on his own Twitter feed by delivering results and never wavering from two core beliefs that influence everything he does:

Part 1 of 2: The reason why Nagios is so noisy – and what you can do about it

By |2019-04-17T03:02:56+00:00December 1st, 2015|Blog|

If you’re struggling with a flood of Nagios alerts, this two-part blog series is for you. We’ll take a close look at the complicated relationship that IT and Ops professionals have with the monitoring tool, explain why Nagios is so noisy, and discuss the simple way that you take charge of your alerts and maximize the way Nagios works for you.

15 hours of down time… avoided: part two of a two-part series

By |2019-04-17T06:35:21+00:00October 31st, 2015|Blog|

This is part two of a two-part post about using event correlation to thwart DDoS attacks. Channeling Mark Twain: it would have been shorter if I had more time. In the last post I described why DDoS attacks for SaaS providers are no different than performance and availability issues experienced in other domains like healthcare, finance, or retail. In this post I’ll share a customer story about a security breach that never happened… thanks to a savvy DevOps team and data science.

Why DDoS attacks aren’t just a security problem… and monitoring traffic isn’t the solution – Part One

By |2019-04-17T02:03:44+00:00October 16th, 2015|Blog|

Every company’s a target, every customer’s at risk. But the now-cliched threat of data breaches from Distributed Denial of Service (DDoS) attacks obscures a bigger threat: outages that impact not just data integrity but also profitability, brand equity, and customer retention. 

The volume of attacks is growing and so is the impact of down time. According to Akamai’s most recent State of the Internet report, DDoS attacks are a bigger threat than ever before. “The number of DDoS attacks continued to increase substantially in Q2 2015, more than doubling the number observed in Q2 2014.”

Hey Silicon Valley, you’re wrong about “Data Science” and “Machine Learning”

By |2019-04-17T02:01:54+00:00August 31st, 2015|Blog|

Tsunami detection. Crop dusting. Biohazard monitoring. What may sound like innuendos in the next EL James novel are also fields being revolutionized by quant jocks and smart algorithms. And yet, despite all the innovation, we technorati continue to bastardize the terms “data science”, “machine learning,” and “big data”. They’ve become lazy speak for “we’re not sure what we’re doing so we’ll hand wave cliches until we have real technology and a business model."

#Monitoringlove in Portland

By |2019-04-15T11:53:35+00:00June 12th, 2015|Blog|

Last year was an amazing experience, and we couldn’t wait to come back for more. BigPanda will be back at Monitorama to hear talks from leading open source developers, web operations experts, and a variety of thought leaders in the monitoring space.

How 83% noise suppression saved Vlad a million dollars so far this year

By |2019-04-17T02:53:53+00:00April 21st, 2015|Blog|

I met Vlad in the bar in Vegas after a long day of telco NOC drudgery. He was enjoying his whisky and clearly didn’t want to be interrupted by me asking about his datacenter. I could tell he’d rather I had asked about anything else… Cat Stevens, Greek myths, Faberge eggs. Anything. I interrupted him anyway and asked what’s required to go from the three nines he referenced in his keynote to the five nines his customers demand. He winced in pain. I thought he swallowed an ice cube or his Johnnie Walker was laced with cyanide. Turns out he was deep in thought. He proceeded to share wisdom that inspired me… to drink whisky and grow facial hair.

How a Culture of Sharing Transforms IT Incident Management

By |2019-04-16T12:34:09+00:00January 22nd, 2015|Blog|

Earlier this month at BigPanda we released our new Sharing feature, which allows NOC teams to quickly share active and critical incidents with the right teams and subject-matter experts. BigPanda already helps NOC teams today by giving them instant visibility into incoming related alerts so that they don’t have to sift through dozens of emails and web pages with every outage or disruption. They can also attach playbooks and timeseries graphs directly to BigPanda, which means no more navigating around, combing through bookmarks, trying to find the right wiki page for that memory issue, or the right Graphite link for that misbehaving database host.

Getting Started with BigPanda – Incident Triage

By |2019-04-15T12:35:43+00:00October 17th, 2014|Blog|

BigPanda is an incident management platform for modern IT, Ops, and DevOps teams. With BigPanda, you will prioritize and route your incidents better and faster, while vastly improving your team's collaboration and processes. This is part 2 in a series on Getting Started with BigPanda. This guide will help you get up and running quickly and maximize the value you get out of the platform.

Getting Started with BigPanda – Incident Analysis

By |2018-04-17T18:52:34+00:00October 15th, 2014|Blog|

BigPanda is an incident management platform for modern IT, NOC and DevOps teams. With BigPanda, you will prioritize and route your incidents better and faster, while vastly improving your team’s collaboration and processes. This is part 3 in a series on Getting Started with BigPanda. This product introduction will help you to get up and running quickly so you can get back to hunting fail-whales and 404 errors.

Getting Started with BigPanda – Assign Incidents

By |2019-04-15T12:28:53+00:00October 13th, 2014|Blog|

BigPanda is an incident management platform for modern Ops environments. With BigPanda, you will prioritize and assign your incidents better and faster, while vastly improving your team’s collaboration and processes. This is part 4 in a series on Getting Started with BigPanda. This guide will help you get up and running quickly and maximize the value you get out of the platform.

Golden Age of Developers = Nightmare for Ops

By |2020-04-08T20:17:35+00:00September 18th, 2014|Blog|

The last ten years have brought enormous changes to production environments, driven by a best-of-breed approach to production infrastructure enabled by open source and cloud.  This has been a boon for developers in terms of flexibility and productivity,  but it’s also placed a new set of challenges and expectations on Ops.

New Relic and BigPanda = #Monitoringlove

By |2019-04-15T13:42:26+00:00July 8th, 2014|Blog|

Monitoring applications in production has never been easier. With only a few code lines, you'll have New Relic installed and monitoring your application from nearly every angle. When something goes wrong, New Relic will start sending alerts. But then what? (hint – New Relic and BigPanda together is the answer).

A Practical Guide to Anomaly Detection for DevOps

By |2019-04-02T08:22:12+00:00June 26th, 2014|Blog|

Anomaly detection for monitoring has been a trending topic in recent years. And while the math behind it is fascinating, too much of the discussion has revolved around histograms, moving averages and standard deviations. More discussion needs to happen around its practical applications, and for that reason, this practical guide to anomaly detection will attempt to provide an actionable overview of current off-the-shelf anomaly detection tools.

4 Ways to Combat Non-Actionable Alerts

By |2020-04-08T20:14:16+00:00April 23rd, 2014|Blog|

Many alerts place an unnecessary burden on Ops teams instead of helping them to solve issues. The main problem is that most alerts are not actionable enough:

  • They point to issues that don’t require a response
  • They lack critical information, forcing you to spend time searching for more insights in order to gauge their urgency