2025-01-14

Metrics, monitoring, alerts

Software Engineering
Project & Product
Shows a hyper-realistic photo in widescreen format with a scene in a home office environment. It's very tidy and clear. You only see a desk with lots of monitors, a laptop to the left and 1 large over-ear headphones lying on the desk on the right. There is also 1 flower. Otherwise the picture is purist. It is intended to depict the workplace of a developer and bathe it in soft light. There is a close-up of the screens on which the content can be recognised: ‘Monitors, Alarms, Metrics’.
Shows a picture of Rudolph smiling in front of the camera

WRITTEN BY

Rudolph

CONTENT

When we think of these terms, many of us immediately picture certain scenarios: large dashboards displaying CPU usage across multiple servers or the number of website hits per second. We might imagine Slack or email alerts with status updates like 'new container deployed' or 'service unavailable'. Some might imagine huge operations centres, where DevOps teams stare at dozens of screens in shifts - hoping that all the indicators stay blissfully green. Either way, most of us associate it with a highly technical and complex subject, full of graphs and alerts.

But where do all these graphs and status indicators come from? Who decides what to monitor and why? And what happens when an alert turns red? Most importantly, what exactly do metrics, monitors and alerts mean?


What is it really?

Metrics are the basis for monitoring technical systems. A good definition can be found at digital ocean. Simply put, metrics are any measurable data points collected in technical systems.

Google’s site reliability engineer book provides an apt description of alerts.

The key point is: An alert is a targeted notification intended to be noticed by a human. It informs of a malfunction or deviation from normal behaviour and demands a response.

Monitors link metrics to alerts. They derive alerts from the evaluation of metrics and indicate whether a system is functioning properly or not. Effective Monitoring and Alerting sums this up well.



Up to this point, monitoring sounds like a purely technical topic - confirming the initial associations. But is that all there is to it? And why bother? What's the added value?


Can we skip it?

Alerts and monitors highlight anomalies and errors. The goal is to resolve problems before users notice or are affected. Poor user experience can quickly lead to reputational damage, loss of customers and, ultimately, lost revenue. Monitoring should therefore protect the business case - and therefore the product. This is the only way to justify the additional effort and cost.

The popular literature often addresses the implementation of alerts from a technical perspective: monitoring server CPU and RAM usage, or network availability of systems. While these are technically straightforward and accessible metrics, they lack a direct connection to the product. What we want to monitor is the correct functioning of our product. This begs the question: are there better alerts?


Can alerts be derived from business logic?

Let's take the user perspective. If I'm a customer of an online shop and I can't add an item to my shopping basket, I don't care whether there was a bug in the application code, the server's CPU was overloaded or there wasn't enough network bandwidth. Like a product vision, a glitch is best understood from the user's perspective. We aim to detect such failures with appropriate alerts.

Take the example of a broken shopping trolley. There are many possible technical causes:

  • The "add to cart" button triggers the wrong API call.
  • The item price is missing from the database.
  • The backend schema definitions don't match the database schema.
  • The backend server is overloaded.

Depending on the specific implementation, this list could go on indefinitely. A business-driven alert such as "user can't add item to cart" usually doesn't depend on a single basic technical metric. It typically requires the monitoring of multiple metrics and their combinations. Modern monitoring platforms like Datadog simplify this. They allow aggregated alerts defined by multiple thresholds and enable application-level monitoring of APIs. Implementing business-driven alerts is also supported by the choice of system and software architecture. For example, if the API is also designed based on business logic, it is much easier to derive alerts at that level. In the shop example, a single 'add to cart' API call would be ideal. This can be specifically monitored for errors, with the corresponding error messages providing clear clues for root cause analysis. This leaves just one question: if an alert is triggered, what happens next?


No response, no problem?

As mentioned above, every alert requires a response. The problem signalled by the alert needs to be identified and resolved. The specific steps and actors involved depend heavily on the software and its architecture. However, one principle always applies: without a response, an alert and its implementation are pointless. Therefore, every alert should be accompanied by a 'runbook' - a clear set of instructions for quickly identifying, analysing and resolving the cause of the alert. A business-defined alert is often more helpful than a purely technical one.

An alert such as "More than 1% of calls to the 'add to cart' API have failed in the last 5 minutes" is better than "CPU usage on the backend server has exceeded 70%":

  • it clearly indicates that a core functionality of the shop is not working, thereby affecting the user experience.
  • it provides a clear direction for runbook development: a timeframe for the problem, the name of the affected API, and an error message with more details about the cause. This facilitates log analysis and troubleshooting, improving resolution speed and responsiveness.

Conclusion

At Spaceteams, we firmly believe that the technical details of software systems and their implementation should always be derived from business requirements and their value to users. This applies to both the product and its monitoring. Monitoring can actively contribute to system reliability by detecting and reporting failures from a business or user perspective. Clear guidelines for runbooks - responses to alerts - further improve the operational experience, leading to faster response and resolution times, reduced downtime and a more stable user experience.

A robust monitoring system also increases developer confidence in the applications they manage. This can reduce time to production and increase feature delivery rates - ideally increasing the business value of a product.

The additional effort required to develop a business-oriented monitoring system can therefore be economically justified, provided the following conditions are met:

  • Each alert has a runbook.
  • The number of alerts is kept as small as possible without leaving blind spots.
  • The system is fundamentally stable.

If the last point is not met, frequent alerts can lead to them being ignored rather than addressed. The alert state becomes the new normal. If the system isn't stable, development should focus on the product before allocating resources to monitoring.

While technical details play a crucial role in building a monitoring system, and the vast literature on this broad and complex topic is well justified, we view technical implementation as a secondary step. The most technically advanced system, like the most beautiful codebase, is meaningless if no one wants to buy or use the product it monitors.


Sources

https://www.datadoghq.com/blog/monitoring-101-alerting/

https://www.digitalocean.com/community/tutorials/an-introduction-to-metrics-monitoring-and-alerting

https://sre.google/sre-book/monitoring-distributed-systems/

https://www.oreilly.com/library/view/effective-monitoring-and/9781449333515/ch01.html