As a former NOC engineer, I clearly remember my onboarding, and especially the deep-rooted fear I felt every time I encountered an alert that was new to me – particularly during a night shift. My only consolation was that I was never alone during training, so there was always someone I could ask that very awkward question: “I’m new here, what do we do with this…?”
Where Is…? Who Should…? What The…?
The people I used to go to for advice embodied a well-known term in any organization: “tribal knowledge”. A company’s tribal knowledge is defined by Wikipedia (which quotes Six Sigma ) as “Any unwritten information that is not commonly known by others within a company. This term is used most when referencing information that may need to be known by others in order to produce quality products or services”. Basically, the term refers to those people (or even worse – that one person) we all know that have in-depth knowledge about a specific subject, because they have been working on it for years. And the acknowledgment that, without them, we are all doomed.
In IT, and specifically in the NOC, tribal knowledge is super critical to business; it can often make all the difference between an alert that has been taken care of in a timely manner, and a full-blown, revenue-shredding outage. That’s why, when I first started my training as a NOC engineer, I was extremely stressed out by the fact that in a few weeks, I would have to know and support every process, server and website that the company operated. And even worse – that if any problem should present itself, I would have to resolve it myself. I quickly realized that, although there was sufficient knowledge in the company to deal with nearly every type of incident, part of this knowledge was scattered among several different knowledge bases and run books with no specific methodology or logic, and the rest was stored in the minds of senior NOC engineers.
So, when new alerts came in, I had to investigate them to the best of my ability using different monitoring tools, knowledge bases, run books and logs, while also trying to correlate them to other alerts coming in simultaneously.
This process was always time consuming, and in some cases nerve wracking, as service disruptions were piling up while I was trying to make sense of what was going on. It certainly wasn’t helpful that my monitoring systems were not providing me with all the information I needed to solve each incident. Easy access to information about impacted customers, affected systems, the owners of the application that was down, and maybe even a link to a relevant run book or documentation would have done wonders to my ability to solve incidents faster.
Today, as a product manager at BigPanda, a key part of what I do is to help customers avoid the pitfalls I encountered.
Turning Tribal Knowledge from a Challenge to an Advantage
There are several key guidelines that help organizations identify, collect and adopt tribal knowledge in their operations. Some are procedural, others relate to the tools they work with.
- Identifying a tribal knowledge ‘designated driver’, one person, whose job it is to make sure that all undocumented (or badly documented) knowledge is identified, collected and documented in a consistent, accessible manner across the whole organization. This person doesn’t necessarily need to be the one actually doing the documenting, but should rather be responsible for the documentation process itself, delegating different subjects to relevant subject matter experts.
- Introducing an event management system that can ingest this data and display it alongside its related incident as needed. By providing users with the operational processes and business data related to the incidents they are dealing with, such a system can assist them in their timely prioritization, analysis and resolution. BigPanda’s Open Integration Hub does just this, allowing out-of-the box integrations with most enterprise data sources which house tribal knowledge – from simple Excel sheets, to complex commercial systems.
- Automating operational processes that implement tribal knowledge, essentially making it available to everyone, and thereby enabling easier and faster incident resolution. BigPanda provides many levels of user-defined automation in the detection, classification, correlation and resolution of incidents – all of which can be informed by the ingested tribal knowledge/data described above. From automatically routing an incident to a relevant team, to automating ticket creation, to correlating alerts into a single incident using BigPanda’s Open Box Machine Learning – BigPanda utilizes tribal knowledge to help detect problems, identify their root cause and resolve them faster and more easily.
Bottom line? Tribal knowledge in IT Ops doesn’t have to be a challenge. If handled properly – it can actually become an asset.