BigPanda blog

Everything you need to know about IT Operations Analytics

Everything You Need to Know About IT Operations Analytics

IT engineers and executives are responsible for system reliability and availability. The volume of data can make it hard to be proactive and fix issues quickly. With over a decade of experience in the field, I know the importance of IT operations analytics and how it can help identify incidents and enable agile responses.

In this article:

What Is IT Operations Analytics?

IT operations analytics (ITOA) uses big data techniques to analyze IT system performance. These findings help companies deploy IT resources more efficiently and effectively. IT teams turn to ITOA to diagnose and fix problems more quickly, thus reducing outages.

Organizations need ITOA because the IT environment is complex and changes frequently. Analytics helps cut through this complexity by making issues more visible and speeding up the troubleshooting process.

ITOA solutions look at a constant stream of data about system health. They spot trouble signs and flag them for IT Ops teams, which includes centralized IT operations teams, network operation center (NOC) teams, and increasingly, DevOps and SRE teams. Analytics helps these teams locate and diagnose problems. This streamlines a process that would otherwise be much more cumbersome to manage manually.

ITOA vs. AIOps

ITOA is evolving, and AIOps is its successor. Both analyze IT operations data. ITOA uses Big Data techniques. AIOps uses machine learning and artificial intelligence. This enables AIOps to improve on ITOA and be predictive and preventative.

AIOps stands for artificial intelligence for IT operations.

ITOA vs. System of Intelligence

ITOA and systems of intelligence both improve IT operations. But ITOA is functionally oriented. A system of intelligence takes a wider view. It looks at technology within the context of the whole organization and is aimed at strategic purposes.

A system of intelligence is one of three types of systems under a theory about how companies gain an advantage over competitors with their technology systems.

Systems of intelligence stand between systems of record, which include applications containing data on IT management along with systems for customer, enterprise, and employee data, and systems of engagement, which are communications channels such as a website and social media.

The system of intelligence integrates systems of record and applies machine learning and analytics to their data. Then the system of intelligence yields insights about the business.

In IT, the system of intelligence performs some functions similar to ITOA. But the goal of systems of intelligence is more transformative. The system uses business intelligence and predictive analytics to drive competitive advantage and innovation.

Benefits of IT Operations Analytics

IT Operations Analytics offers many benefits. They help companies keep their technology systems working efficiently. They enable IT Ops teams to maximize resource usage, improve user experience, and limit downtime.

Top ITOA benefits include:

  • Ability to gain a comprehensive view of all IT operations
  • Automated notifications for common problems
  • Better decision-making
  • Decreased downtime
  • Efficient resource usage
  • Faster troubleshooting and problem resolution
  • Identifying hotspots that generate the most alerts
  • Improved user experience and satisfaction
  • Optimization of system and application performance
  • Proactive identification of issues
  • Quick resolution of common issues
  • Reduced risk associated with system changes

Applications for IT Operations Analytics

IT Ops teams apply operations analytics in multiple ways. Some of these use cases aim to figure out the causes and solutions of IT problems. Other uses are focused on understanding how the system performs and how to improve that performance.

  • Assist Root Cause Analysis: ITOA helps IT teams determine the root cause of an issue. This may be hard to spot if the initial problem caused a cascade of effects or multiple issues occurred at once. Event correlation, which links problems to system changes, helps significantly. If there are multiple root causes, ITOA can rank them in priority order. This speeds resolution and aids prevention.
  • Find the Right Owner: Analytics helps identify the department, team, or person that is best placed to solve the problem. That shortens time to response and resolution compared to passing around the issue before reaching someone who can solve it.
  • Optimize System Performance: IT teams can leverage analytics to understand how varying conditions affect system uptime, service availability, and overall system performance. This understanding helps IT Ops anticipate how the system will act in the future.
  • Visualization: ITOA models and patterns of IT infrastructure and applications can add to understanding of system architecture, network topologies, and dependencies from other mapping and discovery tools. This knowledge helps locate the site of an issue.
  • Understand Business Impact: Operation analytics can put issues within the context of the overall business. ITOA can highlight and prioritize problems that affect revenue generation. This may delay resolution of a less important issue that was reported earlier. Since metrics for time to resolution are typically the benchmark for grading IT teams, this may require changes. But it aligns IT with the business.
  • Automate Action: Once you have visualization, root cause analysis, and other insights from ITOA, you can create automated response steps. For example, certain conditions, error codes, or events can trigger actions. These could include diagnostics and notifications, as well as putting a predefined run book into action.

How ITOA Applies Big Data Principles

ITOA works on Big Data principles. The purpose is to use your company’s data for better business outcomes. The key Big Data steps are gathering, storing, and organizing data. These efforts enable you to perform analytics and visualizations.

ITOA unifies data from:

  • Data logs from the network, hardware, applications, and other system information
  • Monitoring solutions
  • Software agents that observe and report on the IT environment and resource usage
  • Virtual machine monitoring (VMM) software, also known as a hypervisor

The information flow from these tools is characterized by the three Vs of Big Data: velocity, volume, and variety. The various surveillance, monitoring, and reporting solutions produce data in large quantities, at high speed, in multiple formats, and from a variety of sources. The best practice is to use an analytics tool that brings together all your data sources and provides a unified view of your entire IT ecosystem.

The Big Data technologies that enable users to perform analytics on this data include open-source software frameworks such as Hadoop for data lakes and unstructured data stores such as NoSQL.

IT Operations Analytics mine these large data volumes and find patterns and relationships in the data. These findings are the basis for algorithmic models that spot anomalies.

Working with data in this way represents a shift from the traditional approach in which IT Ops teams looked at data within each of their monitoring tools. Examining each piece in isolation leads to a fragmented view. One common pain point for teams was the need to toggle between screens to see each tool’s output.

Big Data makes it possible to bring data from all the monitoring and reporting tools together, both for more effective analysis and a simplified single-pane view for the user. IT teams gain a holistic picture of system performance. Doing this makes sense because the system’s components interact, and issues in one area affect another.

Some people describe this integration as data-driven IT as opposed to tool-driven IT because the data set as a whole directs IT actions, not the output of individual tools.

This evolution dovetails with trends toward integrated monitoring architecture, cross-functional teams, and continuous monitoring and improvement. In addition, continuous integration, continuous deployment, and continuous delivery of code updates increase the value of ITOA.

IT Operations Analytics Architecture

To maximize ITOA performance, the architecture needs to have scalability, interoperability, security, and flexibility. ITOA systems built on open-source tools facilitate this architecture.

Features an ITOA analytics architecture offers include:

  • Scalability: Can expand as systems and data volume grow without bottlenecks, usage restrictions, or cost barriers
  • Interoperability: Works with all operating systems and programming languages; is open and nonproprietary
  • Integration: Can integrate data in many ways, including APIs, middleware, and virtually; also provides uniform access and common storage methods
  • Security: Does not put the organization’s systems or data at risk
  • Flexibility: Integrates data of all types, from all tools, in one store

Many companies have built IT monitoring systems piecemeal, acquiring different tools for different needs such as network monitoring or applications support. This tends to result in an abundance, or even an excess, of tools. Each tool would produce helpful but siloed data. Robust ITOA demands integrating data from all sources with Big Data principles.

ITOA architecture provides complete visibility into the IT environment by working with data from all sources. These include:

  • Agent Data: Data from monitoring and surveillance agents, which can include agents that detect software coding errors
  • Human Data: Data resulting from human activity, including text, images, video, social media posts, and more; most ITOA systems can store this information, but IT Operations Analytics for this data type are immature.
  • Machine Data: Data reported by the system itself, such as audit logs and event tracing
  • Synthetic Data: Data created to test systems and services; this data emulates real data, including data that simulates customer transactions in different locations
  • Wire Data: Data from communications among system layers, from Layer 2 (data link) to Layer 7 (applications)

The operations analytics system must be able to handle the following:

  • Complex Queries: These use multiple parameters and may require joins across multiple data tables and nested subqueries.
  • High Query Volume: The system is able to serve concurrent queries.
  • Live Sync: The database automatically and continuously updates with new data from all sources.
  • Low Data Latency: Updates to data are visible within a few seconds.
  • Low Query Latency: Results are returned in near real time.
  • Mixed Data: Data of different types are stored together, minimizing cleaning and reducing latency.

Four Types of IT Operations Analytics and When to Use Them

ITOA includes the four common types of analytics: descriptive, diagnostic, predictive, and prescriptive. These progress in complexity and difficulty. Descriptive analytics looks at data to describe what has happened. Prescriptive analytics answers the question, “What should we do next?”

As organizations increase their experience with ITOA, they become increasingly capable and ready for a more difficult level of analytics. In an analytics maturity model, prescriptive analytics requires the most maturity.

  • Descriptive IT Operations Analytics: This type of analytics provides information about what has happened in the IT environment. An example of this would be when the ITOA system detects customers having trouble checking out from the company’s e-commerce site. The IT team can swing into action and fix the problem before more sales are lost. Another example would be looking at historical data to calculate the IT Ops team’s mean time to resolve (MTTR), the average amount of time it takes to fix an issue.
  • Diagnostic IT Operations Analytics: This helps pinpoint the source and cause of the IT problem. For example, ITOA through root cause analysis can highlight an issue with the integration to the e-commerce site’s payment processor.
  • Predictive IT Operations Analytics: This tells you what is likely to happen. For example, based on historical data about past system crashes, ITOA can identify the system state, usage patterns, and other factors that are likely to cause a system outage in the future.
  • Prescriptive IT Operations Analytics: This supports better decision-making by telling you which actions will produce the best outcomes. It uses simulation and optimization algorithms. This area of ITOA is the least mature. Decision support from prescriptive analytics improves as ITOA becomes more proficient working with data ambiguity. For example, analytics can tell you that the company is better off building a new data center now based on usage patterns, network traffic, geographic distribution of sales, growth trends, and the relative costs and maintenance needs of adding capacity to existing data centers versus building them.

How Analytics Can Improve IT Operations and Services

IT operations is a metrics-driven function and teams should keep score as a core practice. Services and sub-services break, alerts of varying quality come in, incidents are created, and services get fixed.  Analytics can help IT teams improve these operations.

Through the entire incident management pipeline, key performance indicators (KPIs) can help organizations find gaps in their process, increase efficiency, and measure the performance of their people, systems, and tools.

Service downtime or its opposite — service availability and reliability — are the most critical measures that require constant monitoring and improvement.

Bear in mind these pointers:

  • The quantity and quality of event and alert streams vary.
  • The signal-to-noise ratio helps define how good your primary process input is. As you implement improvements, measure the changes.
  • MTTx metrics are useful. Pivot them by team, service, source, or other attribute to rapidly identify gaps.

Examples of IT Operations Analytics Reports and When to Use Them

Operational analytics reports and dashboards give insights into key trends about IT operations management. Some of the most-watched items are how engineering teams and IT systems are performing. Here are a few examples of typical IT Ops reports used by IT Ops managers and executives:

Team Performance: This report shows incidents assigned to each engineer, the percentage resolved, whether the engineer resolved or escalated the issue, and more. This helps track workload balancing and team efficiency, as well as drive accountability.

Team Performance Chart

Hotspots: The report helps identify services that are creating the most noise. You can use this report in combination with other data to determine if certain systems are providing useful event data or simply creating alert fatigue.

Hotspot Chart

Mean Time Between Failures: This shows the average time between failures. For example, this can track which systems or applications take the longest to bring back online. This lets you know where to focus improvement efforts.

Mean Time Between Failures

IT Operations Analytics Use Cases

ITOA’s most important role is to drive better business performance. This results from IT systems that are more reliable and efficient. Use cases demonstrate how IT analytics can impact customers and the business.

With the right solution, IT operations managers can view the status of all monitoring and surveillance systems from one screen. This adds clarity and efficiency.

For example, a video gaming studio has many players online around the world simultaneously. The volume of alerts can easily be overwhelming. But ITOA can consolidate repeat instances of the same problem into one issue, a process known as compression. Then analytics correlates these issues with system changes and health conditions to pinpoint causes.

When the studio introduced a new online multiplayer game, the launch triggered 3,000 alerts. But analytics compressed those by 99 percent, resulting in only 35 tickets. That made the IT Ops team’s job manageable, and it improved the experience for customers, resulting in a win for the business.  See how Bungie used ITOA for the win.

In an IT Operations Analytics use case in the entertainment industry, TiVo was dealing with a rapid pace of change and innovation found it challenging to proactively detect and resolve issues before customers were affected. With ITOA, the company within four weeks was able to correlate 80 percent of alerts into actionable incidents, speeding resolution.

Predictive Analytics in IT Operations

Predictive analytics has various uses in IT operations. These findings anticipate what will happen in your IT environment so you can take action. For example, predictive analytics can identify the best corrective steps to solve recurring issues.

If analytics forecasts outages, the IT team can act proactively. They can perform maintenance or bring backup systems online to prevent a disruption. Predictive analytics can enable teams to automate responses to common incidents.

Machine Learning for IT Operations Analytics

Machine learning powers predictive analytics. These ITOA algorithms are trained to learn normal and abnormal conditions. They include context such as time of day, season, business conditions, and other variables. Machine learning’s strengths include the ability to work with all kinds of data.

This allows AIOps to work with structured and unstructured information, such as the output of various monitoring, topology, logging, and other tools. Despite the plethora of data, analytics can filter out irrelevant alerts and noise. Then ITOA flags meaningful anomalies. This enables teams to catch issues before users are affected.

But machine learning is not without challenges. Depending on whether the machine learning is a version of explainable AI or “black box AI”, IT teams can still encounter false positives and notification fatigue.

Also, refining and advancing ML-driven analytics require data science expertise. The primacy of data scientists in building many systems makes the analytics process very opaque. This “black box” quality causes distrust and skepticism among some user groups. IT engineers want more transparency and control.

Leverage ITOA for Business Benefits with Unified, Purpose-Built Analytics

ITOA leaders can achieve faster incident resolution and prevent outages by leveraging unified analytics that is purpose-built for IT operations. Purpose-built IT Operations Analytics are not general-purpose reporting or BI tools that have been adapted for IT operations. Instead, purpose-built IT operations analytics offer out-of-the-box IT ops KPIs, widgets and dashboards, and are designed for different IT operations personas such as NOC managers and directors, VPs of IT operations, and application and service owners. .

Learn how BigPanda’s Unified Analytics for IT Operations can help solve the problems of siloed reporting, weak or general purpose dashboards and not being able to use IT ops data to foster better decisions.