How Expedia modernized operations on one of the world’s most fastest-moving IT stacks
It’s not everyday we are given a chance to get a first-hand look at how one of today’s leading and most advanced enterprises operates its IT stack. That’s why we were very excited when three senior IT executives from Expedia accepted our invitation to participate in a webinar discussing the company’s IT modernization journey. Bill Hancock – Expedia’s Director of Site Reliability Operations, Donata Wonsowicz – Senior Director of Reliability Platform and John Chao – VP of Technology, sat down with us together to discuss key insights from Expedia’s AIOps journey, and what a truly insightful discussion that was.
Ultra complex, ultra fast
The Expedia group operates over 200 booking sites which serve over 3.7M hotels and vacation rentals, and book around 400M rooms per year. And we haven’t even mentioned the 500+ airlines they collaborate with and the thousands of smaller travel agencies around the world that use Expedia’s platform and APIs. To support this, the team has built a complex, fast moving tech stack.
Expedia’s stack is both on-prem and cloud based (aka hybrid); operates on all runtime environments – including serverless; does so in many different operating models – with both centralized and decentralized DevOps, SRE and Operations teams; and has a tool sprawl like most other large enterprises, with over 50 different monitoring tools! And how fast is this complex stack? Expedia engineers push thousands of deployments every single day via CI/CD pipelines.
Obviously – such complexity and speed create many challenges which Expedia quickly identified. As Donata Wonsowicz explained, these include:
- Allowing “local” DevOps autonomy while maintaining enterprise level continuity.
- Lowering production risk from high-velocity deployments.
- Lowering noise levels from the many monitoring tools to ensure shorter incident resolution times.
- Aggregating all the events collected from across Expedia to create holistic, end-to-end visibility across their hybrid environments and multiple tools.
- Standardizing key operational metrics for measuring success and driving impact.
BigPanda AIOps is here to help
Expedia understood the solution to all these challenges lay in their ability to determine how to make sense of all the information coming in from all of their monitoring and change tools. As Bill Hancock succinctly explained: “…We needed to understand how things related… we had multiple cockpits that people had to watch and to constantly switch between, to create context. We discovered that the actual correlation is extremely hard, if not impossible for a person to do”.
They selected BigPanda to help. Bill explained: “What BigPanda has helped us do is take all the sources that we have put through, and provide correlation… we can see events happen from a single cockpit. BigPanda discovers a lot of things that the brain just can’t keep up with, certainly in a large team that spans the globe… we like to use the analogy of an iceberg: you might see the tip, but it’s what’s underneath that is huge, and it’ll sink your boat. BigPanda has shown us in some cases that not only do you not see the whole iceberg, but sometimes even the tip is transparent”.
A few big takeaways
Adopting BigPanda’s Event Correlation and Automation platform, powered by AIOps, has helped Expedia in its modernization journey, with quantifiable results:
- Enabling DevOps. BigPanda has allowed Expedia to reinvest developer and engineer time saved from reducing MTTR towards feature development
- Better incident response. Expedia has managed to shorten the “support chain” by bundling and sending all the relevant, important information related to an incident, to the right person to fix the issue.
- Operational efficiency at scale. With BigPanda, Expedia IT Ops has been able to remain lean and even assign resources to focus on projects that can drive revenue.
- MTTR improvement. The main concern for any business, the metric that most affects reputation and revenue, has been significantly reduced!
Recommendations and best practices
One the most interesting takeaways from the webinar were the hands-on recommendations from John, Bill and Donata regarding best practices – from both the operational and the organizational perspectives. No less valuable were their candid and detailed answers to questions from the participants around processes and tools. For anyone wanting to learn more about how to successfully cross the chasm of IT modernization – and the role AI and automation can play there – their insights are pure gold.