Building Triple Pat

To build a SaaS for software reliability, we need to make sure we are building a system that is, itself, very reliable. Tony Hoare famously said there are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.

To make sure we aren't adding to the problem, we are repeatedly endeavoring to make the system as simple as possible. Along with simplicity, we want to minimize cleverness. Although it feels strange to mix Harry Potter and the ACM Turing Award, the other principle we hew to when constructing this system is Never trust anything that can think for itself if you can't see where it keeps its brain. So we want to make our systems as dumb as possible, and when they need to be smart, we want to be extra-clear about where the brain is.

Our check-in system is simplicity itself. It is a service that tracks (UUID, timestamp) pairs, and the only system that is allowed to say what "now" is, is the server. The internal database maintains the invariant that the time is never allowed to decrease, which means that the (UUID, timestamp) pairs form a CRDT, which means that our databases can safely operate with every independent node as a master, and we are guaranteed of eventual consistency.

The only "brain" of note is the phone app. It repeatedly polls Triple Pat servers (any server in our geographically distributed set of servers) for the last check-in time of each of a user's UUIDs. If the time ever gets too old (configurable in the app), then the app will display an alert. If the Triple Pat servers are down, then the last check-in time will not be updated, so the user will be notified --- our alerting system fails noisily.

By keeping the database conceptually simple (although master-master mode is always a little complex) and by putting all the brains in the phone app, we can guarantee that if there are no alerts from Triple Pat, then we know for sure that the last check-in time is within acceptable bounds. If the alarm is firing, then Triple Pat may be broken or the user's service may be down. Triple Pat continues to exist and have customers only if we make sure that it is almost always the case that the alarm fires when the user's service is down.

All of this is conceptually simple, but it requires a relatively broad knowledge-base to actually build. We have two servers built in Go (one centralized system for user identity services, and one distributed system that just maintains the (UUID, timestamp) pairs), a website built with 11ty, phone apps built with Kotlin and Swift, and a complicated deployment strategy with servers spread across lots of providers. Each piece is simple and has a well-defined function, which is the only reason it remains tractable.

Contribute to the discussion by replying on BlueSky!