Managing IT Operations in Crisis

As entire industries have transitioned to remote work, and businesses are impacted by an existential global pandemic, the performance of IT Operations teams are more important than ever. Operations personnel need to monitor, triage, communicate, and manage incidents throughout this transition 24/7, across all services. Corporate VPNs, SaaS applications, as well as legacy on-premise and homegrown tools and systems are all stretching to meet the new demands from the business, and some will encounter problems or fail outright.

Here are some ways CIOs and their IT Operations teams can adapt to the immediate challenges this crisis is creating, evolve their organizations to survive in the short-term and be well-positioned in the long-term:

People matter most

Consider the changes your teams are going through, and the pressure this puts on people, including your Operations Center, SREs, DevOps, other engineers, and all the teams they support throughout the business. Taking care of your people through a crisis is paramount. As the situation continually evolves, leaders need to rise above the noise and focus their teams on their missions. This is one part “keep calm and carry-on” and one part “here’s what you can do right now to improve our situation.” Leading teams by helping them remain focused and looking beyond the horizon is paramount

Build and maintain the big picture

IT Operations teams often have the best, and least-leveraged, situational awareness in a company. As they see alerts or people reporting problems in real-time, IT Ops teams gather and use that information to assess, prioritize, diagnose, and resolve incidents across networks, infrastructure, applications, and services. Their birds-eye view of the business is critical to leadership and individual contributors alike; understanding the health of all of your employee and customer-facing services gives you precious visibility that enables the agility needed in responding to the changing business landscape.

Too often, this visibility remains siloed within an IT Operations team, and only when issues become critical are parts of that awareness pulled or pushed throughout the organization. IT Ops should proactively promote visibility regarding service performance, team performance, new trends in usage or incidents, and all else they can see from their privileged position and the rest of the organization cannot. Even reporting “situation normal, all systems up” is useful to a remote workforce. From high-level periodic reports, through service-status dashboards, to the very tactical live incident status page, the objective is to create a shared awareness across teams, so they can make decisions and focus effort in a way that’s aligned to a common ground truth.

Maintain IT change awareness

All services require some level of routine hygiene. Change-velocity has ramped up significantly over the years. Managing IT change effectively can be more difficult across distributed teams. In today’s environment, some teams will slow down as they switch to working from home, but some will have to speed up to respond to business demands. Regardless, your application updates, database maintenance, server OS updates, security fixes, and network configuration changes are still needed, and might be more difficult than usual to implement without disruption. IT Operations teams know from experience that changes always present some level of risk, so most teams will track them in order to correlate them to service impacts. With individuals performing these changes working remotely, there’s less collective awareness, so a centralized change process and change information hub (including the critical what, when, why, and who) can help people to deconflict changes from one another, minimize risk to the business, and, for IT Operations, provide rapid correlation of changes to impacts, reducing recovery times when a change goes sideways.

Proactively measure, analyze, and report

Every IT Operations team has a multitude of metrics and KPIs used to report on their own processes and service performance. From DAU to MTTx to Service Availability, these numbers usually stay within a predictable range and people don’t necessarily track them regularly.

However, due to current changes in the internal work environment and external business conditions, they might be on the move. If organizations don’t have them, they should try to implement them, looking for outliers and trends that help to define and quantify how service usage, and team performance, is changing. CIOs and their IT Operations teams will see what’s moving in the wrong direction, and highlight problem areas for C-levels as well as the teams who need to engage and adapt to the current situation. Organizations’ OODA loop can be much faster if the “orient” and “observe” phases are tied to well-defined metrics that are consistently used to understand business performance in real-time.

Burn down the backlog

Some businesses and teams will experience a slowdown, and resources will become available. SREs, DevOps, and other engineers can be given achievable goals by using this opportunity to take care of those long-standing but low-priority projects, like filling monitoring gaps, completing “mostly done” implementations, taking care of noise in event streams, and generally paying down tech debt. Companies can focus managers and non-engineering ICs on streamlining workflows, improving processes, and providing effective reporting. Every IT department has a list of issues that are perennially backlogged in both engineering and non-engineering fields, and high maintenance, legacy, homegrown systems that they want to replace but haven’t been able to. Now is the time to focus teams on what they CAN do, give them a legitimate sense of purpose in the here and now, so that the business is that much more agile when the economy gets back to full speed.

Reduce the tool sprawl

Solution sprawl is a growing challenge, and many CIOs struggle to support multiple solutions in the same space. Chat, project management, CI/CD, orchestration, data visualization, monitoring solutions, ticketing systems, even ERP’s and public clouds; many organizations are using multiple examples of each on different teams, and IT Ops can barely track them. Supporting them requires more resources, and can make maintaining a high-availability, secure network environment a challenge. Remote work at scale exacerbates these issues, fragments the ability to get a big picture, and reduces overall organizational unity.

Balance these costs against the value each unique solution brings, and compromise when there is a real differentiator. Now is a good time to drive teams to the best-in-class, cloud-based SaaS solutions that natively support work from home. A time of crisis may also mean better pricing… (do we want to say this?) and consolidation can possibly reduce OPEX. Examine your teams’ workflows, and identify places where multiple tools create large delays due to mental switching costs; streamline through integration and/or aggregating related data into a single pane of glass. IT Ops teams will have less to manage and better tools to do it with. The need for organizational agility is the best argument against sprawl, and that need is visible across nearly all industries right now.

And lastly, invest in the future of IT Ops

Rarely an organizational focus, IT operations is as critical now as it’s ever been. The IT operations mission is to provide consistent, reliable, 24/7 monitoring and incident management across all services, in order to keep the business running. It provides actionable information to your teams while filtering out the noise, and it faces a never-ending battle. Every single day IT Ops teams deal with all the problems, large and small, that can occur within an ever-evolving service technology stack, and try to prevent or limit the impacts to users and customers. Now they are doing it from home, with an as-yet-poorly-defined threat of a global pandemic hanging over everything. This crisis might be the catalyst needed to trigger an investment in IT Ops’ people, processes and technology that should have been made years ago, and provide the necessary agility and scalability for the future. IT Ops needs CIO-level support in the form of organizational focus and executive sponsorship so that it can evolve to keep pace with the business. And sometimes, teams just need a pat on the back for getting the job done.

About the Author: Jason Walker

Field CTO at BigPanda. Former Director of IT Operations at Blizzard Entertainment, where he led the company’s transition to support multi-product, always-online game services. He spent the last six years running IT Operations across Activision-Blizzard-King, building their GNOC into a world-class OpsCenter.