Few things damage productivity as much as waiting. The act of waiting forces us to context switch, disrupts our creative momentum, and eliminates our ability to experiment. Whether we are deploying a new service or troubleshooting a problem, waiting puts a heavy tax on team productivity in IT Operations.
In contrast, nothing is as delightful as things that don’t make us wait: scripts that complete their work in under a second, queries that return instantly, and web pages that don’t take forever to load.
Over the past decade, automation technology has done a lot to speed up previously manual, menial IT tasks with faster, more efficient processing. Advanced machine learning and AI technologies are being applied to the Big Data problem in IT – for example, sifting through loads of event data to correctly detect critical problems.
But now that the machines have become faster, the burden of waiting has shifted to us humans. Once AI has correctly flagged an incident, Level 1 operators often are left waiting on Level 2 domain experts or Level 3 engineers to remediate and clear the issue. Smooth, frictionless collaboration and information share is key to quicker MTTR times.
The hiccups in incident management too often now lie with the front end user interface. Too many IT Operations teams are often forced to depend on complicated, difficult-to-use consoles for incident management. The simplest of actions can require multiple steps too often slowed by system latency. This in turn negatively impacts incident resolution times, because effective troubleshooting, historical root cause analysis and cross-team collaboration all rely on speed. When dealing with outages and other critical customer-impacting problems, any factor that impedes team productivity and efficiency must be corrected.
The Necessity of a Console in Autonomous Digital Operations
In attempting to define the building blocks of Autonomous Digital Operations, we debated the relative merits of incident management consoles to the product category. Ultimately we concluded that a unified interface isn’t necessary for software vendors claiming an ADO solution. For example, ADO could be deployed as a virtual software appliance feeding correlated incident data to remediation tools such as Red Hat Ansible or BMC BladeLogic.
That being said, BigPanda built our Operations Console with the following assertions. We submit that any console for Autonomous Digital Operations should, at minimum:
- Handle ticketing and workflow for incident management
- Support centralized collaboration across Level 1, Level 2, and Level 3 teams
- Keep a searchable historical record of alerts and incidents
You’ll find that BigPanda Operations Console is so much more than that. First, it is fully customizable, with an environment that allows you to easily personalize operational data in the console for specific teams, clouds, apps or services. Second, it is highly intuitive, with dynamic visualization capabilities like analytic dashboards, incident timelines and wallboards. Taken together these allow teams to view the overall health of your infrastructure, track the real-time status of critical services and applications, and drill down into the history of an incident to more easily investigate probable root cause. Remediation progress can also be measured and reported per service or team.
The Need for Speed in IT Incident Management
Speed is an invaluable commodity for both the NOC operator and the DevOps engineer alike. At BigPanda we are so passionate about speed that we’ve made our Operations Console the fastest, easiest-to-use incident management dashboard in the world.
As a native SaaS platform, BigPanda ensures that customers always enjoy the most current version of our Unified Console. We made a few architectural choices early on to speed up our system performance, and are constantly evolving and improving its front end functionality.
Low-Latency Data Pipeline – Our data processing pipeline is composed of fast, message-driven microservices in a highly secure and redundant SOC 2 compliant infrastructure. We measure and alert on any latencies in each of these asynchronous microservices and leverage a vertically and horizontally scalable architecture. This ensures that nothing ever devolves into a performance bottleneck. Recent improvements ensure that our inbound data pipeline processes events faster with less latency in a multi-tenant system handling very large alert volumes (around 100k+ per minute). Our autonomous detection capabilities – which include steps like normalization, enrichment and correlation – take less than 100 milliseconds on average.
Real-time Front End – Legacy monitoring front ends rely on full page reloads for updating, or they use AJAX-based polling which introduces latency and unnecessary load on the back end. By contrast, BigPanda’s front end maintains an open websocket to our back end. New events and status updates are pushed to the front end entirely in real time. This means, for example, that a Nagios-generated alert is likely to appear in BigPanda before it shows up in the Nagios dashboard.
Reactive User Interface – Data-heavy web applications are often sluggish and clunky, and data volumes for large organizations can render them unusable. To avoid this, BigPanda implemented a set of UI performance optimizations such as virtual scrolling, flexbox liquid layouts, svg visualizations & manipulation buffering. We’ve tested our UI with tens of thousands of concurrent incidents with no noticeable impact on overall performance and responsiveness.
Team Collaboration Features – We introduce new user interface and navigation features regularly. For example, the Incident Feed Wallboard allows the entire IT Ops team to view the infrastructure’s health at-a-glance and track activity in one comprehensive view, even as new incidents stream in. Environment Groups provide a hierarchical way to organize common functions – such as business services, teams and infrastructure areas – that roll up under Environments grouping related incidents.
So to conclude, the benefits to customers of BigPanda’s “need for speed” are higher productivity, centralized visibility and improved team collaboration.
Get to know BigPanda’s Operations Console. It provides a satisfying experience to any speed-addicted DevOps or NOC.