Design of the checkin service

June 20, 2025

“Making a service that is radically simple and reliable”

The critical insight is that a service collecting (timestamp, uuid) pairs that only cares about the most recent timestamp for a given UUID is actually maintaining a crdt. All that work that databases need to do in order to prevent conflicting updates just goes out the window, and we can create a database that can be run in parallel with multiple masters without having to worry about the CAP theorem.

Everything else is a consequence of that one insight, combined with a desire for simplicity across the board. The rest of this document is about how we built the backend system, and the principles we stuck to in the process.

Architecture

We use a microservice architecture. Each service is a separate binary run in its own separate Docker container. There are three main services: the user service, the email service, and the checkin service. We implement and run these services to support the phone apps (which run on iOS and Android on user’s phones) and user checkins (which run inside users’ infrastructure).

The user service is necessarily more centralized because it needs to perform actions exactly once. This means that the user service is inherently less reliable, so we should design the phone apps to not need to contact it more than once a week — they should try every day (or maybe even more often?), but things whould only start going sideways if they have been out of contact for a week.

The checkin service is distributed. Our fleet of 5+ checkin servers operate in master-master mode. Each can be authoritative. The checkin-service must be made reliable. We achieve that reliability through redundancy and running things in multiple clouds. Users should always be able to contact a checkin server to either perform a checkin or to retrieve the time of the last checkin.

The email service is distributed the same way email is distributed: with MX DNS records. Each box in our fleet is running an email server, and the SMTP protocol supports failover from one MX recipent to the next. Receiving check-in emails causes the email service to invoke the check-in service for the appropriate UUID.

Users use {protocol}://{server}/api/v1/checkin/{uuid} to check-in and {protocol}://{server}/api/v1/getlastcheckin/{uuid} to get the last check-in for a given UUID. {protocol} can be http or https. {server} can be any of:

triplepat.com triplepat.net
a.triplepat.com a.triplepat.net
b.triplepat.com b.triplepat.net
c.triplepat.com c.triplepat.net
d.triplepat.com d.triplepat.net

The email service is also distributed, because the way SMTP and DNS MX records work means that we can specify a number of email servers and the senders’ MTAs will automatically fail-over. We run one email-service container on every mirror. This allows users to check-in by emailing {uuid}@checkin.triplepat.com or {uuid}@checkin.triplepat.net.

All our APIs, UUIDs, URLs, and email addresses are case-insensitive.

As much as possible, we store all “brains” and user preferences in the phone apps. That way, the user doesn’t have to contact us very often except for checkins. The checkin server has no knowledge of users or userids. It just stores (UUID, timestamp) pairs. We never trust user clocks, so when users perform a checkin, we store a timestamp generated by the server rather than one from the user’s phone. Our servers’ clocks are synced by our cloud providers. We trust them more than we trust user clocks.

Defaults

We use default, boring technology whenever possible. Every time we are clever, we should know the reason why.

This means we use:

SQLite for the checkin-service database.
PostgreSQL for the user-service database.
Tailscale for internal networking.
Letsencrypt for SSL/TLS certificates.
cert-manager to manage them on each host.
nginx for our public web server and reverse proxy.
Docker for containerization.
Docker Hub for our container images.
Prometheus for monitoring and metrics.
Grafana for monitoring dashboards.
Github for our source code management.
Github Actions for CI/CD.
Go for our server programming language.
Kotlin and Swift for our mobile programming languages.
AWS and GCP for our cloud providers, and we only use tiny vanilla machines to avoid lock-in. We also use TILAA to avoid lock-in to “just the big providers”.
Godaddy for our DNS provider for triplepat.com
CloudFlare for our DNS for triplepat.net
Hugo for our static website generator.

We also use every available linter and formatter to make sure our code, configs, and output are internally clean and consistent and align with the expectations of the outside world. Every lint rule we use is one less rule we need to remember, and human brain space is precious.

Reliability

We acquire reliability via the “cockroach” strategy of many cheap replications around the world. We run multiple instances and we run instances in multiple clouds. To do this, we need to make sure the computers we rent in any given cloud are not too expensive. We also need to monitor our systems because “if you aren’t monitoring, you don’t have uptime”.

Efficiency

We keep our memory requirements low, because the cost of running a single server directly translates to how many servers we can realistically run, and more servers means more reliability. LOCKSS is the order of the day. Right now our services collectively (nginx + Alloy + checkin-service) run fine on 1GB of RAM, which means we pay under $20/server/month. Going to 2GB doubles the per-server cost and 4GB quadruples it, etc. For services that don’t need much CPU, RAM is the cost driver, with network usage lurking behind it. Our cockroach strategy means we need to respect RAM usage.

Storage

We use PostgreSQL for the centralized user-service data. This allows for better concurrency and scalability compared to file-based databases, which is important for the user service’s role.

For the distributed checkin-service, we use SQLite. It is simple, small, allows us to store data in a single file per instance, and we don’t have to write our own serialization code. The synchronization between checkin-service instances is handled at the application level, not the database level.

We avoid using database-specific features where practical to retain flexibility. We use PostgreSQL for the user service because we need to be able to write ad-hoc queries over our set of users, so we need the storage to be safely readable and writable by other processes.

Logging

We log using slog. Because of its flexibility, we can swap out backends if we need to use someone else’s log processing software (e.g. Logrus and the like), but outside of main(), things are logged to slog or they aren’t logged. You should use the slog global logger, or derive your custom logger from it, as every main function should set it up with good settings.

Monitoring and Metrics

We set things up to be monitored with Prometheus. Its pull-based logging semantics make the most sense for us because it means that Alloy can choose how much/often to scrape and we can tune things to make sure we don’t spend more than we intend by sending too much to Grafana. Also, pull-based semantics force us to generate metrics that are sampleable, like counts, instead of ephemeral metrics like gauge values. It is appropriate to force the developer to choose the right metric type, because they are the one who knows best what is being measured.

Strongly prefer Counters and Histograms for metrics instead of Gauges. Prometheus works by sampling metrics, and sampling a counter or histogram coarsens data but sampling a gauge loses data.

Every metric name should lint correctly with promlint. Every metric name should also end with the unit of measure or _total. This helps us build dashboards and combine values in a sensible manner. Use base SI units when possible (“meters” instead of “kilometers”), and strongly prefer counts to gauges.

Docker

Every binary we run in production is run in a Docker container. This means each of them need a Dockerfile in the same directory as their main() function. Please use checkin-service/Dockerfile as an example to get you started. If you do it right, then you can keep compilation times low while still getting the benefits of in-container unit testing.

Every Dockerfile should do one thing. Containers are not VMs and should not be used like VMs. We enforce this by using distroless images where we can and only falling back to Alpine-linux images when we need ad-hoc debug tooling.

Run one process per container, and try and invoke that process without forcing the shell to parse a string to run it (use the array form of ENTRYPOINT). The process should run as nobody or we should have a good reason why it can’t.

Use ENTRYPOINT instead of CMD in the Dockerfiles, because our Docker images are meant to hold and run a single binary.

Build and deploy in different containers using multi-stage builds. Among other things, this keeps the deployed size of our images quite small (under 20MB). Also, deploying a compiler alongside every service is wasteful of disk space and bandwidth, but it also needlessly adds complicating pieces to our images. We keep things simple by removing needless complications.

Every Dockerfile should have a HEALTHCHECK directive in it. This allows deployment to stall until a container is actually working instead of treating container invocation as a “fire and forget” operation. There is a healthcheck directory containing helpful libraries and a healthcheck/check-health binary to perform http-based health-checks (like the ones provided by the healthcheck library). Also, the person building the service knows best what the healthcheck should be, so it is appropriate to make the Dockerfile author design it, and not the deployer.

Each Dockerfile should, as much as possible, start with exactly the same text as the others. In particular, everything up through the running of all the unit tests should be the same. This allows us to only run the unit tests once when we build all our containers, because docker build can use the cache-hit from testing the first to skip the tests for the second.

Health checks

Every daemon run in a dockerfile should support health checks. The simplest implementation of health checks can be found in our internal healthcheck library. This contains a server to report health to the container, a binary to connect to that server and report the health back to docker via its exit code, and a simple interface for things that want to be checked.

All of our docker images should have health checks, one way or another.

It is easy to add a healthcheck to your service by implementing the healthcheck.CheckableService interface which has one method. You should have a health check for every service and pass all such services to the health check server when you start invoke it in main().

We should open-source this small library. It is so useful and good.

Test coverage

All code should be 100% covered by unit tests. Getting uncovered code back to being covered should be considered a high-priority task. 100% coverage doesn’t guarantee no bugs, but it does guarantee that every line of code can be executed without crashing in at least one context. Also, the difference between 100% and 99% is psychologically much greater than the difference between 99% and 98% or 79% and 78% — once you start allowing coverage to slip, it’s hard to get it back up and easy to let it get worse. Beyond simply executing every line, striving for complete coverage encourages writing code that is inherently more modular and designed to be testable. “Keep it 100” and cover everything to make sure things stay working and the design stays good.

Linting

All code is linted with golangci-lint. All code has to lint clean, and this is enforced by our Github Actions. You can put exceptions in with comments, but the exceptions should be few and well-motivated. If we find that a particular linter run by golangci-lint is requiring lots of exceptions, we should decide to either turn that linter off or comply with it everywhere.

All prometheus metrics and Dockerfiles are linted too!

Because we don’t have a big team and are often working under time-pressure and “just trying to get it to work”, we want to use every linter possible. Everything that can help us offload “smell tests” to automated machines is a good idea.

Error handling and recovery

Feel free to use rtx.Must and rtx.ValueOrDie liberally for errors that are genuinely unrecoverable or indicate a programming bug (e.g., parsing a known-good embedded asset). Crashing immediately in these cases makes the problem obvious and prevents potentially corrupted state. These helpers also simplify testing, as you don’t need to write tests for error handling paths that simply crash.

All errors must be explicitly handled. Common handling strategies include:

Crashing: Using rtx.Must or rtx.ValueOrDie for unrecoverable errors, as described above.
Returning the error: Passing the error up the call stack for a higher-level function to handle. This is the standard Go approach and is perfectly acceptable.
Handling the error: Implementing specific logic to recover from the error or take alternative action within the current function.
Logging the error: If an error cannot be handled appropriately by the current function (and returning it is not suitable, perhaps because it’s in a background goroutine or at the top level of a request handler), use slog.Error(...) to record it. This increments a Prometheus counter, making unhandled operational errors visible.

Avoid both logging at error and then returning the same error, as this often leads to duplicate log entries. Choose one appropriate handling strategy for each error and stick to it. If you can handle the error, then it likely should cause a Warn or Info log message because it’s not a true error, it’s just something unexpected. Casually logging at the Error level causes alert spam and operator fatigue.

Command line arguments

We build our Docker images with ENTRYPOINT set, which means we pass in command-line arguments in list form as part of the command: directive in that appropriate YAML file.

Operations

We use GitOps as our operations and deployment strategy. When you merge a new tag to the go repo, an image with that tag is built and pushed to the Docker Hub. We use Github Actions to build and push the images.

Code repos run all tests and linters on every push, and build and push new containers on every tag. Config repos push the configs to production on every push or merge to main. We acknowledge that code is configs and configs are code, so the line is fuzzy, but the categorization in practice should be clear in each case.

Push to prod on merge to main:

server-configs
website

Build+Lint on merge, push on tag:

go
ios_app
android_app

We keep our configurations, as much as possible, in the server-configs repo and deployed as the compose.yml file that is run by docker compose. If we move to k8s, then it will be a different file name, but the principle is the same: one file to organize them all.

Deployments of containers always use --wait to ensure that all the containers are healthy before moving on to the next machine. In this way, bad configs that break containers at start only take out one machine.

We rollback just like we roll forward: push and merge to main.

Security

We use letsencrypt for SSL/TLS certificates. We use cert-manager to manage them on each host. Each host manages its own certificates because then we don’t have to worry about running centralized secret provisioning. All our public services are behind nginx, which terminates SSL/TLS and proxies requests to the appropriate internal service.

Our internal-only services are behind Tailscale. We only share internal services between machines over our Tailscale network.

In general, we prefer a “fail2slowdown” instead of an “ostrich” or “fail2ban” strategy. We want to slow down attackers, but we also want to keep our users happy. We want to be able to detect and respond to attacks, but we also want to be able to detect and respond to normal traffic. Public services should be wrapped in a handler to slow down responses to IPs that have sent us erroneous requests. This prevents us from being overwhelmed by bad requests and prevents rumplestiltzkin attacks. That said, people developing against our service will often make lots of bad requests before figuring out how to make the good ones they want to use, so we don’t want to ban them.

Each bad request from an IP causes all responses to that IP to be slower by 1 second for the next two hours, up to a max added delay of 10 seconds.

We also prevent having too many simultaneous connections (good or bad) from a single endpoint both in our nginx config, and in our Go code.

We add a no-index and no-cache headers to all API responses.

Networking

Internal services are not exposed to the public internet, and are only exposed over Tailscale connections. We use Tailscale for all internal networking between hosts. Within a host, we use docker-compose to run services and it sets up a virtual internal network for them.

We use Godaddy for triplepat.com DNS (both registration and serving). We use Cloudflare for triplepat.net DNS (both registration and serving). We have no CDN and do not want one.

We have matching records for DNS on two different TLDs. We have www., @, MX, a., b., c., and d. records for triplepat.net and triplepat.com. Those TLDs are run by two different providers.

Website

We use Hugo to generate our website. Our website is static. Comments are provided by linking to BlueSky posts.

Time

Programming using time is often terrible for testing and a breeding ground for subtle bugs. As much as possible, we use the clocks package to provide a clock that is easy to mock in tests. We use clocks.System for all time-related operations in production, and we use clocks.Static or clocks.Blocking for all time-related operations in tests. We use the .Sleep(), .Now(), and .After() methods associated with those clocks instead of time.Sleep().

Ideally, time.Now() and time.Sleep() (and the helper methods in time which call .now(), like Until() and Since()) should never appear in the codebase outside of the clocks package and _test.go files.

Memoryless

For jobs that need to happen repeatedly, we want to prevent synchronization. The best way of preventing synchronization is to ensure that the events occur in a Poisson process. For that, we created memoryless.Ticker{}, which has one method, Tick(), which produces a <-chan time.Time just like time.Tick().

Production code should use memoryless.Ticker.Tick instead of time.Tick or time.NewTicker or it should have a good explanation as to why.

Subscriptions

Users log in to our services using “Log in with Apple” or “Log in with Google”. This causes us to create an account for them if they don’t already have one. They then get our free tier of service. In the free tier, you get 3 UUIDs, and your UUIDs get scrubbed and deleted every month. If you want UUIDs to never get deleted, you can subscribe to our paid plan.

If you have a subscription, then you get UUIDs that are not scrubbed and are yours forever.

We have a web-hook in user-service to accept notifications from the Apple App Store about new subscriptions. That web-hook updates the state in our DB, and it is the DB state that we will use for our scrubbing utility. The iPhone app also reports its subscription state and supports sending the purchase receipt to us for verification, so we try to make sure we never do harm to an existing customer.

Summary

Keep it simple, respect CS theory, test everything, lint everything, and use automation everywhere. That’s how we built it.