Detecting Defects in Software Systems

Throughout my time working with complex distributed systems, I have learned that consistently detecting systemic issues is surprisingly difficult and a bit of an art. The canonical way to do this is to set up alerts that notify an engineer when something is wrong. I have personally written many terrible alerts and regret having subjected myself and colleagues to the pain of such useless alerts.

Core Tenets

My opinions on alerting revolve around three tenets.

First, complex systems are always in some state of failure. Although we love to do it, asking whether a system is “up” or “down” is much less meaningful than asking pointed questions like “is our API error rate normal?” This is especially true during incidents.

Second, it is not possible to predict all the ways a system can fail. The traditional approach to alerting is to predict root causes and alert on them. My experience is that this is a hopeless endeavour; I am continuously amazed at the novel ways systems fail.

Third, the sole responsibility of an alert is to notify us of some imminent or ongoing issue. Alerts are not debugging tools, and receiving a flurry of alerts after an incident has already been declared is often not helpful.

Trigger Types

	Alert	No alert
Issue	True positive	False negative
No issue	False positive	True negative

Great alerts consistently trigger when there is an issue, and don’t trigger when there is no issue. False positives and negatives are both terrible, but for different reasons. False positives erode trust in the system and fatigue engineers on-call; it is hard to take on-call seriously if you are constantly bombarded with irrelevant alerts. False negatives delay incident response and can have economic and reputational consequences.

The Naive Approach

The naive approach to alerting is to attempt to predict the ways a system can fail, and then set up alerts on those predicted root causes. Let’s consider a “Container CPU is reaching limit” alert as an example.

On the surface, this alert seems sensible: a container reaching its CPU limit probably could cause an incident, so let’s make sure we catch the issue before that happens. In practice, this alert is virtually guaranteed to cause false positive triggers and be generally not useful.

For one, this alert treats all containers equally, and will happily trigger for both a log collector and the system’s primary database, despite these being two vastly different error scenarios. The knee-jerk reaction is to update the alert’s selector to include only important containers or exclude less important containers. This may work for a short while, but systems change all the time, and updating an alert’s selector is about the least sexy engineering work imaginable. Before long, the list of which containers are important will change, and the alert’s selector will not.

The second issue is thresholds. This alert must answer two questions:

How close to the CPU limit does the container need to be?
For how long must the container be above this threshold?

The first question is no doubt the easier of the two, but setting the right threshold is still tricky because different workloads react differently to CPU throttling; one workload might be perfectly fine running at 93% of its CPU limit while another may react negatively to running at 85% of its limit. Finding a threshold that works across all workloads is not realistic. Duration is also problematic: set it to one minute and the alert will be much too trigger-happy; set it to 30 minutes and the alert will fail to notify when a brief massive traffic spike to your API slows your site to a crawl.

The only way I have seen this alert be borderline acceptable is to make it non-paging¹ and set the duration to several hours. This way, the alert can help catch workloads that are slowly reaching their CPU limit due to organic growth. Even in this case, the better solution is to manage resources automatically, obviating toil work like manually increasing workload CPU limits.

Other alerts I commonly see that suffer from similar issues:

“Some important workload has too few pods”
“A container is crashing”
“Some event that should happen frequently has not occurred”
“Some containers are failing their health check”

Some of these alerts are better than others, but they all have the same fundamental flaws:

They rely on intimate details about the system. This renders the alerts outdated as the system changes beneath them.
At best, they capture only a tiny subset of possible failure modes. These alerts are likely to miss novel failures.
They correlate poorly with the user experience of the system. Users don’t know or care that a cron job has not run for a while or that a container is unable to pull its image; they care about the consequences of those issues.

The Postmortem Trap

As part of the postmortem effort, it is tempting to create alerts for important signals that were missed during incident response. For example, if the root cause of an incident was improper balancing of interrupt requests across CPU cores², then let’s set up an alert so we are sure to catch the next occurrence. I consider this a trap. It may be prudent to set up new alerts after an incident, but only if incident response was late because of a missing or delayed alert. If new alerts are set up, then they should monitor user-level journeys that were affected by the incident, not low-level signals used for debugging during response. Those are better suited for a dashboard. Remember: alerts are not debugging tools.

An interesting question I have never seen asked during a postmortem is: “did any superfluous alerts trigger during the incident that we can remove?” Over-alerting is an issue, and it is worth considering whether some alerts overlap.

To improve the alerting setup, we need a solution that is agnostic about implementation details, able to catch unforeseen problems, and more accurate in capturing the system’s user experience.

A Better Alternative

Rather than alerting on predicted root causes, we can define the system’s central user journeys and then alert when those journeys are degraded. Using Spotify as an example, some of their user journeys might be:

Users can play songs.
Users can create a playlist.
Users can add songs to their playlists.

Formally defining the system’s central user journeys in this way allows us to monitor their success and trigger an alert when they fail; if song plays are served via an HTTP API, then we can alert when that API starts returning erroneous responses.

This solves the aforementioned issues:

User journeys are agnostic of implementation; playing a song has remained a core Spotify user journey for many years regardless of how much the underlying implementation has changed. User journeys change much less frequently than system architecture³.
Monitoring success closer to the point of user interaction catches unpredictable lower-level failure modes. As an example, I have previously encountered an issue caused by many extremely high network throughput workloads convening on a single node. This failure mode was not on the team’s radar at all, but the high-level user journey alert caught it.
User journeys are by definition at the heart of the user experience. As a user of Spotify, I care about whether I can play a song, not what the CPU pressure of the song catalogue service is.

Service Level Indicators

Once the system’s user journeys are understood, then we can set up service level indicators to gauge the current level of service provided. In our Spotify example, a service level indicator for song plays might be the proportion of successful responses to total responses⁴ in the HTTP API that serves plays. If 98 out of 100 song play requests succeed, then the service level is 98%⁵.

It is paramount that the service level is measured as close to the point of user interaction as possible. Measuring responses directly from the API pods will miss a failure in the ingress pods. Measuring responses from the ingress pods will miss a failure in the frontend. The closer we can get to the actual user experience, the better. Even then, while service level indicators help tremendously, they are not a panacea; a frontend bug that turns the play button invisible will likely not be caught by a service level indicator because requests to the play API stop altogether⁶.

Service level indicators help us understand what the current level of service is, but they don’t tell us whether it is satisfactory. For that, we need service level objectives.

Service Level Objectives

At some point, thresholds have to be defined for when to page an engineer for an ongoing issue. In complex systems, defining what is a serious issue or incident is tricky. Continuing our Spotify example, consider these failure scenarios:

0.6% of song plays are failing in Latvia.
3% of users on Windows are unable to create new playlists.
Attempting to add a song whose title starts with @ to a playlist fails 100% of the time.

Do any of these failures warrant paging an engineer?

Complex systems are always in some state of failure, so it is not realistic to use “anything greater than zero” as a paging threshold. Instead, we can define service level objectives for our user journeys. In Spotify’s case, these could be:

99.9% of song plays succeed.
99% of playlist creations succeed.
Adding a song to a playlist succeeds 99.5% of the time.

The combination of service level indicators and objectives grants us insight into the current service level, and whether it is at, below, or above the expected performance. Additionally, we get an error budget and the useful burn rate abstraction.

Error budget is 100% minus the service level objective; if the objective is 99.5%, then the error budget is 0.5%.

Burn rate is how fast the error budget is being spent. An API that handles 100 requests per second with an objective of 98% can return two errors per second and still stay within budget. Any more than two errors per second, and the API is burning its error budget. The rate of burn tells us how fast the entire error budget will be exhausted. For a given period of time, a burn rate of one means that the error budget will be exhausted exactly when the time period ends. A burn rate of two means that the error budget will be exhausted halfway through the time period. The higher the burn rate, the more severe the issue is.

Burn rate is an excellent metric to alert on. Assuming sensible indicators and objectives have been set up, burn rate will capture degradations of the system’s user experiences regardless of what the root cause is. Burn rate alerts are also low-maintenance; the same query can be used across different service level objectives, and burn rate will automatically adjust if the objective is loosened or tightened.

Burn rate alert expressions can get unwieldy, and I recommend using Google’s multi-window multi-burn-rate alert expressions. These will trigger a non-paging alert for slow and steady burn and a paging alert for quick burn that risks imminently exhausting the error budget. Multi-window multi-burn-rate alerts are the gold standard.

Impending Limit Alerts

Burn rate alerts are good at detecting issues that have happened, but they do little to warn engineering teams of issues that will occur unless some action is taken. Common examples of this are disks slowly reaching capacity or a database reaching its connection limit. I find that this category of issues is best handled with traditional alerts that monitor how close resources are to hitting their limit. This way, a non-paging alert can fire, letting an engineer know to follow up during work hours. As previously mentioned, setting up automation to handle these limits can save both alerts and toil work, but it is often not feasible to automate management of all resource limits.

Conclusion

Catching defects in a complex system is difficult, and the consequences of getting it wrong can be a bad night’s sleep or an incident with no responders. Service level indicators and objectives encourage engineers to think about reliability and alerting from the user’s perspective. Alerts based on burn rate provide the ideal combination of high confidence and low maintenance, but are not a silver bullet and can be complemented with traditional lower-level alerts that monitor limits for select resources.

Many of the thoughts presented in this post are not originally my own, and I would be remiss not to credit those who have influenced me on this topic. In particular, Rob Ewaschuck’s My Philosophy On Alerting and Google’s Site Reliability Engineering book stand out.

Not able to wake someone up. ↩︎
This was an actual incident that left my team stumped for a while. What really threw us off was that the bug caused average node CPU to go down while performance tanked because a single CPU core tried to handle network interrupt work previously done by all the cores on the node. ↩︎
It is not possible for alerts to be completely agnostic of implementation details. At some point, an alert has to measure a metric emitted by a service. If that metric changes or the service is decommissioned, then the alert will have to change. ↩︎
In the context of HTTP APIs, 5XX responses are commonly considered errors and all other status codes are considered successes. 429 is also sometimes considered an error. ↩︎
It’s important to mention that service level indicators need not be based on HTTP requests. They work well with any event-based activity that can be grouped by good and bad events. ↩︎
This category of failure where activity suddenly stops instead of failing can be caught with alerts that catch sudden large drops in traffic. My experience is that these alerts are decent in practice. ↩︎

Core Tenets#

Trigger Types#

The Naive Approach#

The Postmortem Trap#

A Better Alternative#

Service Level Indicators#

Service Level Objectives#

Impending Limit Alerts#

Conclusion#