BigPanda blog

Machine Learning in IT & Digital Operations: Why Now, And What to Keep in Mind

You’ve just recovered from a critical application outage and your team is being asked to report on root cause and recommended remediation steps later this afternoon. Can you quickly analyze all the data, identify all the leading events, and discern which one was responsible for the cascading failure?

Later that week, you are back to work integrating the retail warehouses’ new inventory management system that’s loaded with IoT sensors. Management wants to monitor the warehouse centrally to make sure that delivery is keeping up with order demands. Your team is working with a ton more data and a new set of monitoring tools to track operating conditions and respond to issues in real-time.

Several weeks later, you’re informed that the marketing team will start their holiday season campaigns a month earlier vs previous years and that the digital operations team better be ready to maintain the availability and performance of the retail website. To make matters more difficult, the development team will be releasing new enhancements right up to when the campaign starts. Your team better be ready to keep up with this year’s expected demand.

Digital Operations must support more systems and recover faster

Many IT and digital operations groups are struggling with the issues of managing operations across a wider breadth of systems and with greater business expectations around uptime and performance. Every second counts when recovering from an issue and there is no tolerance for repeat issues.

The enterprise technology environment has only gotten more difficult to manage. Most enterprises have applications in the cloud, on premise, and hosted with SaaS providers. With every new type of technology, IT has different tools to monitor and manage systems, services, and applications. Sure, these tools enable some level of automated responses, but they are rule-based and require coding many if-this-then-that logic to cover all the potential scenarios. Every outage leads to developing more rules and the digital operations staff can’t keep up

When machine learning enables digital operations to do more with less

Machine learning is driving significant gains in areas where the volume, velocity, and complexity of operational data is too high for people to process on their own. Applying machine learning to IT operational data can lead to faster response to incidents, more automated responses, and root causes that are easier to diagnose. Machine learning algorithms can identify which alert tripped up first during an incident, which system parameters are early indicators of an emerging problem, and what types of user behaviors may be leading to application performance issues.

But IT can’t get these benefits by just plugging in a black box machine learning engine and giving it the reins.

That’s not realistic with today’s artificial intelligence and vendors selling black box AI and ML that claim to identify and respond to any and all IT issues. They’re just selling fool’s gold.

On the other hand, Open Box Machine Learning can yield significant benefits. Open box ML is designed to:

  • Support agile, iterative improvements so that the digital operations team can review patterns and recommendations from the machine learning algorithms before choosing to use them
  • Enable testing of models and responses against historical data to validate patterns and ensure automations respond appropriately
  • Promote collaboration and trust by sharing visual and verbal information on patterns with digital operations teams and their leaders
  • Incorporate business and tribal knowledge easily and quickly, by letting digital operations teams and domain experts add their hard-won, real-world knowledge to correlation patterns
  • See value in production quickly, within just weeks, by eliminating the need to train machine learning models for months or more before results start trickling in

Open Box ML doesn’t try to find and fix everything on its own. It functions as a form of augmented intelligence that enables the digital operations team to dramatically scale to manage and respond to their complex and demanding environment.

Open Box Machine Learning in Action

Let’s look at an example of an Open Box ML applied to one of the earlier examples. The retail site is running in AWS with Cloudwatch monitoring the cloud services and DataDog monitoring the application. IT is using Ansible to configure the environment, Jenkins for CI/CD, Jira for the development backlog and ServiceNow to manage tickets.

When there is an issue, the operations team used to have to look in Cloudwatch to see if it’s a systems issue and then DataDog to look for application issues. Now, the data from these and other monitoring tools are correlated by the Open Box ML and Ops can look at a single incident instead of reviewing disparate alerts from many systems. For example, let’s says a web service is experiencing a performance issue and hundreds of alerts are generated. With Open Box ML, Ops will just have one incident to look at, and resolve the issue faster, using Open Box ML-suggested probable root cause.

The system can then automatically create tickets in ServiceNow and Jira to alert Ops on the incident type and also provide a drill-down view of the underlying data. This helps Ops respond and recover from the issues faster, and over time, it improve key performance indicators such as the mean time to repair (MTTR).

Future Digital Operations Depends on Machine Learning

As an IT leader, you have the opportunity to bring these differentiating capabilities to your organization.

They don’t just improve the uptime of systems, they can be used to improve customer and end-user experience when there are hard to identify issues. They can also enable the team to handle a wide breadth of systems and applications as organizations acquire new technology or integrate new platforms from mergers and acquisition activity.

The sooner you get started, the faster you’ll be able to provide your organization competitive value. Kicking the car until the engine starts doesn’t work anymore.

Isaac Sacolick, President of StarCIO, is the author of Driving Digital: The Leader’s Guide to Business Transformation through Technology which covers many practices such as agile, devops, and data science that are critical to successful digital transformation programs. Sacolick is a recognized top social CIO, digital transformation influencer, and contributing editor at InfoWorld,  CIO.com and Social, Agile and Transformation.