Latency is defined as the time between stimulation and response. For example, latency is the time passed between sending a packet from one network endpoint to another. BigPanda delivers aggregated alerts from the customer’s monitoring solution to the BigPanda web UI. To make sure that we display those alerts as quickly as possible, we measure the time it takes for an alert to be displayed, starting with the moment it arrives into our backend services, and up until it is published to the UI. If this metric crosses a predefined threshold we trigger an alert. By monitoring this metric we kill two birds with one stone: great user experience due to incidents appearing on screen almost immediately, and insight into performance issues and overall system health.
Error rate is defined as the percentage of errors out of the total number of transactions during some period of time. This metric has a significant role in understanding how our system behaves, especially during peak times or after application deployments. Measuring error rates can provide early indication of our system stability. Imagine yourself deploying something to production, assuming that you don’t measure error rate, how would you know if you’re having a serious problem? In the best case scenario users will open a ticket (if you’re lucky), but in most cases, days will pass before you know you have a problem, and at that point you’re light years away from the root cause. System instability might cause your customers to perceive the service as having poor quality, which is certainly something you want to avoid.
Problems Counter for External Dependencies
The two metrics we discussed so far are technical in nature. That’s not necessarily bad, but sometimes business KPIs are a better indication of the health and quality of your service. Business KPIs are usually coupled to the semantics of your offering and collecting them is crucial for understanding the behavior of major features in your product. BigPanda integrates with other monitoring systems, so naturally it is very important for us that integrations go smoothly. Imagine a scenario where a certain monitoring system changes its data format unexpectedly: it’s our job to detect this as quickly as possible and adapt to the change. For this purpose we count integration errors and trigger an alert when a certain integration fails for enough customers. This way we ensure that our commitment for a reliable and stable integration is truly upheld.
UI Speed and Responsiveness
You build a kickass application, with a robust backend that works like “magic”, but your users are still having a bad experience every time they interact with your system: ״It’s just too damn slow!״ If you’re not measuring your end users’ experience, you have no sense of their actual experience with the product, and eventually you’ll lose customers without even knowing why. By measuring page load time, render time and so on, you can ensure that your users’ experience is top notch.
Metric collection is crucial if you want to understand your system and product better. The metrics above might be perceived as less urgent, but they sure are important. It makes sense that you would want to know whether your service is up and running, but it’s not worth much if your customers are unhappy. Collect metrics which will help you ensure your user gets the best experience. After all, a satisfied customer is the best business strategy of all.