Design of the checkin service
“Making a service that is radically simple and reliable”
The critical insight is that a service collecting (timestamp, uuid)
pairs
that only cares about the most recent timestamp for a given UUID is actually
maintaining a
crdt.
All that work that databases need to do in order to prevent conflicting updates
just goes out the window, and we can create a database that can be run in
parallel with multiple masters without having to worry about the CAP
theorem.
Everything else is a consequence of that one insight, combined with a desire for simplicity across the board. The rest of this document is about how we built the backend system, and the principles we stuck to in the process.
Architecture
We use a microservice architecture. Each service is a separate binary run in its own separate Docker container. There are three main services: the user service, the email service, and the checkin service. We implement and run these services to support the phone apps (which run on iOS and Android on user’s phones) and user checkins (which run inside users’ infrastructure).
The user service is necessarily more centralized because it needs to perform actions exactly once. This means that the user service is inherently less reliable, so we should design the phone apps to not need to contact it more than once a week — they should try every day (or maybe even more often?), but things whould only start going sideways if they have been out of contact for a week.
The checkin service is distributed. Our fleet of 5+ checkin servers operate in master-master mode. Each can be authoritative. The checkin-service must be made reliable. We achieve that reliability through redundancy and running things in multiple clouds. Users should always be able to contact a checkin server to either perform a checkin or to retrieve the time of the last checkin.
The email service is distributed the same way email is distributed: with MX DNS records. Each box in our fleet is running an email server, and the SMTP protocol supports failover from one MX recipent to the next. Receiving check-in emails causes the email service to invoke the check-in service for the appropriate UUID.
Users use {protocol}://{server}/api/v1/checkin/{uuid}
to check-in and
{protocol}://{server}/api/v1/getlastcheckin/{uuid}
to get the last check-in for a
given UUID. {protocol}
can be http or https. {server}
can be any of:
- triplepat.com triplepat.net
- a.triplepat.com a.triplepat.net
- b.triplepat.com b.triplepat.net
- c.triplepat.com c.triplepat.net
- d.triplepat.com d.triplepat.net
The email service is also distributed, because the way SMTP and DNS MX records work
means that we can specify a number of email servers and the senders’ MTAs will
automatically fail-over. We run one email-service container on every mirror.
This allows users to check-in by emailing {uuid}@checkin.triplepat.com
or
{uuid}@checkin.triplepat.net
.
All our APIs, UUIDs, URLs, and email addresses are case-insensitive.
As much as possible, we store all “brains” and user preferences in the phone
apps. That way, the user doesn’t have to contact us very often except for
checkins. The checkin server has no knowledge of users or userids. It just
stores (UUID, timestamp)
pairs. We never trust user clocks, so when users
perform a checkin, we store a timestamp generated by the server rather than one
from the user’s phone. Our servers’ clocks are synced by our cloud providers.
We trust them more than we trust user clocks.
Defaults
We use default, boring technology whenever possible. Every time we are clever, we should know the reason why.
This means we use:
- SQLite for the
checkin-service
database. - PostgreSQL for the
user-service
database. - Tailscale for internal networking.
- Letsencrypt for SSL/TLS certificates.
- cert-manager to manage them on each host.
- nginx for our public web server and reverse proxy.
- Docker for containerization.
- Docker Hub for our container images.
- Prometheus for monitoring and metrics.
- Grafana for monitoring dashboards.
- Github for our source code management.
- Github Actions for CI/CD.
- Go for our server programming language.
- Kotlin and Swift for our mobile programming languages.
- AWS and GCP for our cloud providers, and we only use tiny vanilla machines to avoid lock-in. We also use TILAA to avoid lock-in to “just the big providers”.
- Godaddy for our DNS provider for triplepat.com
- CloudFlare for our DNS for triplepat.net
- Hugo for our static website generator.
We also use every available linter and formatter to make sure our code, configs, and output are internally clean and consistent and align with the expectations of the outside world. Every lint rule we use is one less rule we need to remember, and human brain space is precious.
Reliability
We acquire reliability via the “cockroach” strategy of many cheap replications around the world. We run multiple instances and we run instances in multiple clouds. To do this, we need to make sure the computers we rent in any given cloud are not too expensive. We also need to monitor our systems because “if you aren’t monitoring, you don’t have uptime”.
Efficiency
We keep our memory requirements low, because the cost of running a single server directly translates to how many servers we can realistically run, and more servers means more reliability. LOCKSS is the order of the day. Right now our services collectively (nginx + Alloy + checkin-service) run fine on 1GB of RAM, which means we pay under $20/server/month. Going to 2GB doubles the per-server cost and 4GB quadruples it, etc. For services that don’t need much CPU, RAM is the cost driver, with network usage lurking behind it. Our cockroach strategy means we need to respect RAM usage.
Storage
We use PostgreSQL for the centralized user-service
data. This allows for
better concurrency and scalability compared to file-based databases, which is
important for the user service’s role.
For the distributed checkin-service
, we use SQLite. It is simple, small,
allows us to store data in a single file per instance, and we don’t have to
write our own serialization code. The synchronization between checkin-service
instances is handled at the application level, not the database level.
We avoid using database-specific features where practical to retain flexibility. We use PostgreSQL for the user service because we need to be able to write ad-hoc queries over our set of users, so we need the storage to be safely readable and writable by other processes.
Logging
We log using slog. Because of its flexibility,
we can swap out backends if we need to use someone else’s log processing
software (e.g. Logrus and the like), but outside of main()
, things are logged
to slog
or they aren’t logged. You should use the slog global logger, or
derive your custom logger from it, as every main function should set it up with
good settings.
Monitoring and Metrics
We set things up to be monitored with Prometheus. Its pull-based logging semantics make the most sense for us because it means that Alloy can choose how much/often to scrape and we can tune things to make sure we don’t spend more than we intend by sending too much to Grafana. Also, pull-based semantics force us to generate metrics that are sampleable, like counts, instead of ephemeral metrics like gauge values. It is appropriate to force the developer to choose the right metric type, because they are the one who knows best what is being measured.
Strongly prefer Counters and Histograms for metrics instead of Gauges. Prometheus works by sampling metrics, and sampling a counter or histogram coarsens data but sampling a gauge loses data.
Every metric name should lint correctly with promlint.
Every metric name should also end with the unit of measure or _total
. This
helps us build dashboards and combine values in a sensible manner. Use
base SI units when possible (“meters” instead of “kilometers”), and strongly prefer counts to gauges.
Docker
Every binary we run in production is run in a Docker container. This means
each of them need a Dockerfile in the same directory as their main()
function. Please use
checkin-service/Dockerfile
as an example to get
you started. If you do it right, then you can keep compilation times low while
still getting the benefits of in-container unit testing.
Every Dockerfile should do one thing. Containers are not VMs and should not be used like VMs. We enforce this by using distroless images where we can and only falling back to Alpine-linux images when we need ad-hoc debug tooling.
Run one process per container, and try and invoke that process without forcing
the shell to parse a string to run it (use the array form of ENTRYPOINT
).
The process should run as nobody
or we should have a good reason why it
can’t.
Use ENTRYPOINT
instead of CMD
in the Dockerfiles, because our Docker
images are meant to hold and run a single binary.
Build and deploy in different containers using multi-stage builds. Among other things, this keeps the deployed size of our images quite small (under 20MB). Also, deploying a compiler alongside every service is wasteful of disk space and bandwidth, but it also needlessly adds complicating pieces to our images. We keep things simple by removing needless complications.
Every Dockerfile should have a HEALTHCHECK directive in it. This allows
deployment to stall until a container is actually working instead of treating
container invocation as a “fire and forget” operation. There is a
healthcheck
directory containing helpful libraries and a
healthcheck/check-health
binary to perform http-based health-checks (like the
ones provided by the healthcheck
library). Also, the person building the
service knows best what the healthcheck should be, so it is appropriate to
make the Dockerfile author design it, and not the deployer.
Each Dockerfile
should, as much as possible, start with exactly the same text
as the others. In particular, everything up through the running of all the
unit tests should be the same. This allows us to only run the unit tests once
when we build all our containers, because docker build
can use the cache-hit
from testing the first to skip the tests for the second.
Health checks
Every daemon run in a dockerfile should support health checks. The simplest implementation of health checks can be found in our internal healthcheck library. This contains a server to report health to the container, a binary to connect to that server and report the health back to docker via its exit code, and a simple interface for things that want to be checked.
All of our docker images should have health checks, one way or another.
It is easy to add a healthcheck to your service by implementing the
healthcheck.CheckableService
interface which has one method. You should
have a health check for every service and pass all such services to
the health check server when you start invoke it in main()
.
We should open-source this small library. It is so useful and good.
Test coverage
All code should be 100% covered by unit tests. Getting uncovered code back to being covered should be considered a high-priority task. 100% coverage doesn’t guarantee no bugs, but it does guarantee that every line of code can be executed without crashing in at least one context. Also, the difference between 100% and 99% is psychologically much greater than the difference between 99% and 98% or 79% and 78% — once you start allowing coverage to slip, it’s hard to get it back up and easy to let it get worse. Beyond simply executing every line, striving for complete coverage encourages writing code that is inherently more modular and designed to be testable. “Keep it 100” and cover everything to make sure things stay working and the design stays good.
Linting
All code is linted with golangci-lint. All code has to lint clean, and this is enforced by our Github Actions. You can put exceptions in with comments, but the exceptions should be few and well-motivated. If we find that a particular linter run by golangci-lint is requiring lots of exceptions, we should decide to either turn that linter off or comply with it everywhere.
All prometheus metrics and Dockerfiles are linted too!
Because we don’t have a big team and are often working under time-pressure and “just trying to get it to work”, we want to use every linter possible. Everything that can help us offload “smell tests” to automated machines is a good idea.
Error handling and recovery
Feel free to use rtx.Must
and rtx.ValueOrDie
liberally for errors that are
genuinely unrecoverable or indicate a programming bug (e.g., parsing a
known-good embedded asset). Crashing immediately in these cases makes the
problem obvious and prevents potentially corrupted state. These helpers also
simplify testing, as you don’t need to write tests for error handling paths
that simply crash.
All errors must be explicitly handled. Common handling strategies include:
- Crashing: Using
rtx.Must
orrtx.ValueOrDie
for unrecoverable errors, as described above. - Returning the error: Passing the error up the call stack for a higher-level function to handle. This is the standard Go approach and is perfectly acceptable.
- Handling the error: Implementing specific logic to recover from the error or take alternative action within the current function.
- Logging the error: If an error cannot be handled appropriately by the current function (and returning it is not suitable, perhaps because it’s in a background goroutine or at the top level of a request handler), use
slog.Error(...)
to record it. This increments a Prometheus counter, making unhandled operational errors visible.
Avoid both logging at error and then returning the same error, as this often leads to duplicate log entries. Choose one appropriate handling strategy for each error and stick to it. If you can handle the error, then it likely should cause a Warn
or Info
log message because it’s not a true error, it’s just something unexpected. Casually logging at the Error level causes alert spam and operator fatigue.
Command line arguments
We build our Docker images with ENTRYPOINT set, which means we pass in
command-line arguments in list form as part of the command:
directive in
that appropriate YAML file.
Operations
We use GitOps as our operations and deployment strategy. When you merge a new tag to the go repo, an image with that tag is built and pushed to the Docker Hub. We use Github Actions to build and push the images.
Code repos run all tests and linters on every push, and build and push new
containers on every tag. Config repos push the configs to production on every
push or merge to main
. We acknowledge that code is configs and configs are
code, so the line is fuzzy, but the categorization in practice should be clear
in each case.
Push to prod on merge to main:
server-configs
website
Build+Lint on merge, push on tag:
go
ios_app
android_app
We keep our configurations, as much as possible, in the server-configs
repo
and deployed as the compose.yml
file that is run by docker compose
. If we
move to k8s, then it will be a different file name, but the principle is the
same: one file to organize them all.
Deployments of containers always use --wait
to ensure that all the containers
are healthy before moving on to the next machine. In this way, bad configs that
break containers at start only take out one machine.
We rollback just like we roll forward: push and merge to main.
Security
We use letsencrypt for SSL/TLS certificates. We use cert-manager to manage them on each host. Each host manages its own certificates because then we don’t have to worry about running centralized secret provisioning. All our public services are behind nginx, which terminates SSL/TLS and proxies requests to the appropriate internal service.
Our internal-only services are behind Tailscale. We only share internal services between machines over our Tailscale network.
In general, we prefer a “fail2slowdown” instead of an “ostrich” or “fail2ban” strategy. We want to slow down attackers, but we also want to keep our users happy. We want to be able to detect and respond to attacks, but we also want to be able to detect and respond to normal traffic. Public services should be wrapped in a handler to slow down responses to IPs that have sent us erroneous requests. This prevents us from being overwhelmed by bad requests and prevents rumplestiltzkin attacks. That said, people developing against our service will often make lots of bad requests before figuring out how to make the good ones they want to use, so we don’t want to ban them.
Each bad request from an IP causes all responses to that IP to be slower by 1 second for the next two hours, up to a max added delay of 10 seconds.
We also prevent having too many simultaneous connections (good or bad) from a single endpoint both in our nginx config, and in our Go code.
We add a no-index and no-cache headers to all API responses.
Networking
Internal services are not exposed to the public internet, and are only exposed over Tailscale connections. We use Tailscale for all internal networking between hosts. Within a host, we use docker-compose to run services and it sets up a virtual internal network for them.
We use Godaddy for triplepat.com DNS (both registration and serving). We use Cloudflare for triplepat.net DNS (both registration and serving). We have no CDN and do not want one.
We have matching records for DNS on two different TLDs. We have www., @, MX, a., b., c., and d. records for triplepat.net and triplepat.com. Those TLDs are run by two different providers.
Website
We use Hugo to generate our website. Our website is static. Comments are provided by linking to BlueSky posts.
Time
Programming using time is often terrible for testing and a breeding ground for
subtle bugs. As much as possible, we use the clocks
package to provide a
clock that is easy to mock in tests. We use clocks.System
for all
time-related operations in production, and we use clocks.Static
or
clocks.Blocking
for all time-related operations in tests. We use the
.Sleep()
, .Now()
, and .After()
methods associated with those clocks
instead of time.Sleep()
.
Ideally, time.Now()
and time.Sleep()
(and the helper methods in time
which
call .now()
, like Until()
and Since()
) should never appear in the codebase
outside of the clocks
package and _test.go
files.
Memoryless
For jobs that need to happen repeatedly, we want to prevent synchronization. The
best way of preventing synchronization is to ensure that the events occur in a
Poisson process. For that, we created memoryless.Ticker{}
, which has one
method, Tick()
, which produces a <-chan time.Time
just like time.Tick()
.
Production code should use memoryless.Ticker.Tick
instead of time.Tick
or
time.NewTicker
or it should have a good explanation as to why.
Subscriptions
Users log in to our services using “Log in with Apple” or “Log in with Google”. This causes us to create an account for them if they don’t already have one. They then get our free tier of service. In the free tier, you get 3 UUIDs, and your UUIDs get scrubbed and deleted every month. If you want UUIDs to never get deleted, you can subscribe to our paid plan.
If you have a subscription, then you get UUIDs that are not scrubbed and are yours forever.
We have a web-hook in user-service to accept notifications from the Apple App Store about new subscriptions. That web-hook updates the state in our DB, and it is the DB state that we will use for our scrubbing utility. The iPhone app also reports its subscription state and supports sending the purchase receipt to us for verification, so we try to make sure we never do harm to an existing customer.
Summary
Keep it simple, respect CS theory, test everything, lint everything, and use automation everywhere. That’s how we built it.