Reliable Systems via Good Metrics
When building and running reliable systems, you need data to answer three critical questions:
- Is the system working correctly now?
- How well has it been working over time?
- When did things start to go wrong?
Metrics help us answer these questions. Here we focus on Prometheus metrics and some best practices for them.
Prometheus is a pull-based system for specifying, collecting, and saving metrics. It’s pretty easy to set up and very powerful. There are, however, some best practices around metrics and metric names that are easy to get wrong. Worse, changing metric names is surprisingly difficult, because metrics get used in alerts and dashboards, so a rename can easily break “downstream” systems. It is worth it to try and get your metrics right the first time.
A subtle question that people only start asking themselves after using a metrics system for a bit is “What makes a good metric?” Metrics usually have names and values and types, and there are definitely better and worse ways to name, define, and use metrics in your code. As a quick example, if a metric tracks a duration, its name should end with _seconds
, and not _ms
or _s
or _time
. More on this later, but these kinds of rules are important, and are easy to get wrong.
We can describe some best practices around metrics (and we will!) but manual review of these best practices is error-prone. When possible, use a linter to check our metrics. This post describes how to use the promlint
tool to check your metrics, and why you might really want to use it. We end with some metrics best-practices, including some that unfortunately promlint can’t check, but are good to keep in mind, and then exhort you to use promlint
as part of your unit tests.
promlint
for linting metrics
Use promlint
to check your metrics. It’s great. When it complains, it is right and you should fix your metrics. They built a tool with a very low false-positive rate, so you can trust it.
How to use promlint
The go library promlint
contains two methods of note: New()
and Lint()
. Because you want to check all the metrics for your application, you should lint your metrics in a unit test inside the main
package so that all the libraries main()
uses are included. Here’s how I use it to check my metrics.
Now you can test your own metrics, and if you violate the automatically-checkable best practices you will get an error. So let’s look at some of the best practices, both automatically-checkable and not.
Good metric names
A good metric name leaves no doubt about what the metric measures and where it comes from. Right away from this definition of good we get some rules:
- A metric should be clear about where in the code it comes from.
- A metric name should specify the type of metric.
- A metric should be clear about what it measures.
- A metric name should include the unit of measurement.
Pleasantly, the Prometheus project agrees with these rules about naming metrics. This means that promlint
can check your metrics for compliance with those rules! As a quick precis, here are the rules:
- Counter metric names (and only counter names) should end with
_total
- All metric names should end with a unit of measurement (just before the
_total
for counters) - Units of measurement should be an SI unit without a multiplier. Use
_seconds
for seconds and_bytes
for bytes, and avoid_milliseconds
,_ms
,_us
,_ns
,_µs
,_mB
,_megabytes
, and_kb
The linked page has more, but these are the important ones. promlint
will check your metrics for compliance with these rules as much as possible.
Good metric types
There are four types of metrics: counters, gauges, summaries, and histograms. There’s a few more options, but they are more obscure and are easy to get wrong. You should stick to gauges, counters, and histograms, and really you should try to stick to just counters and histograms.
- Counters for metrics that are monotonically increasing.
- Histograms for metrics that need to track a distribution of values.
- Summaries are like histograms, but they are more complex and have more pitfalls. Prefer the simplicity of histograms.
- Gauges for metrics that can go up and down. Use sparingly because they have many subtle pitfalls.
Counters
A critical thing to understand about metrics is that they are collected when the collector wants. Your code has no influence on how often the Prometheus collector comes around to collect metrics. This means you need to design your metrics to be informative even when sampled at very different rates. The most common way to do this is to basically always use counters. Increments of a counter are never missed, the collection rate just determines how coarse the sampling rate is. With counters you will never miss a measurement.
Histograms
Histograms are actually counters under the hood, so everything good about counters is also good about histograms. Histograms never miss a measurement, and the collection rate just determines how deep the metrics user can zoom into the data. If they sample every minute, then they can zoom in and see changes every minute. If they sample every hour, then they can’t zoom in as much. Good histogram names are clear about what they measure, and include the unit of measurement, for example http_request_duration_seconds
.
If you’re measuring HTTP request durations with http_request_duration_seconds
, the histogram will create multiple counters like:
http_request_duration_seconds_bucket{le="0.1"}
(total requests under 0.1s)http_request_duration_seconds_bucket{le="0.5"}
(total requests under 0.5s)http_request_duration_seconds_sum
(total duration of all requests)http_request_duration_seconds_count
(total number of requests)
Now you can see how histograms are a cousin of counters, because they literally are a collection of counters, and everything good about counters is also good about histograms.
Summaries
Summary metrics are a cousin of histograms, but they are more clever internally. They attempt to adaptively track the median value of a metric (or the 95%ile, or some other percentile), but the algorithm for reporting a median is either memory intensive (if they save all measurements and report the exact median at scrape time) or imprecise (if they use a streaming algorithm). The mathematical subtleties are beyond the scope of this post, but the bottom line is that you should strongly prefer histograms. The simplicity of histograms is worth a lot.
Gauges
Gauges are the most problematic type of metric, because there are no rules and therefore no guarantees about gauge values. Gauges have no history, only a current value. A gauge that spikes for 1 second every minute at 15 seconds after the minute will either look like it is constantly zero or constantly spiking when it is collected every minute. Worse, whether it looks like zero or a spike depends on when the collector started! Gauges are best used for information that changes very slowly. I use gauges to track the git commit that the server binary was built from, because that number changes changes very slowly (it only changes when I build and deploy a new binary).
Also, you want to be careful to try and use metrics that aren’t ratios. Export numerators and denominators separately. Don’t export a gauge of http_requests_per_minute
, export http_requests_total
and let the user do the division. This gives the user more flexibility (turning requests per minute into requests per hour or per second seems easy but is not) and avoids issues when the denominator is zero. In general if the word “per
” is in the metric name, you are probably doing something wrong, especially if the denominator is a time unit. Export a counter or histogram for the numerator, and then the user can do the rest.
If you want to keep track of a current allocation, it may be tempting to use a gauge. But this ends up having the sampling problem described earlier. Better is to use two counters, one for allocations and for deallocations. The difference between the two counters is the current allocation. This way you don’t have the sampling problem, and you don’t have to worry about missing massive temporary spikes in allocations. Count packets and bytes sent and received instead of reporting bps. Instead of measuring how much water is in a bucket, measure how much water has been added to the bucket and how much has been taken out.
For example, instead of a gauge like memory_usage_bytes
, consider using two counters:
memory_allocations_bytes_total
(total memory allocated)memory_deallocations_bytes_total
(total memory freed)
This gives you both the current usage (by subtracting) and the historical allocation/deallocation volumes between samples, which a gauge would miss.
By keeping your metrics simple and sample-rate agnostic, you and your metric’s users can change the granularity or adapt them to new situations. This is a huge win. I’ve lost weeks of my professional life to bad metrics in the past, and I’d like to help you avoid that.
Summary
Metrics are important and good!
Tools:
- Use
promlint
to catch common metric mistakes - Run metric checks as part of your unit tests
Naming:
- Your metric name should be clear about what it measures and where it comes from
- Put units in your metric names
- Use SI units without multipliers
- Ensure counter names end in
_total
Types and Usage:
- Prefer counters and histograms over gauges
- Prefer histograms over summaries
- Design your metrics so that you never want to have “per” in a metric name
- Track changes (like total water added and removed) rather than states (current water level)
Use it!
The promlint
tool is great, and you should use it! It will catch some mistakes, and thoughtfully addressing its output will make you think about your metrics in a good way. It is also easy to set up, and you can run it as part of your unit tests.
Along the way, please think deeply about the metric names and types you define and use, and prefer Counters and Histograms. Metrics are a very “sticky” part of a project, so getting them right early is a great way to avoid much bigger headaches later.
Contribute to the discussion by replying on BlueSky!