Learning in public from our incidents

Triple Pat runs a service on computers and offers that service to the world. Because computers only mostly work, but we want our service to work all the time, we have to assemble a reliable service out of mostly-reliable components. A critical part of doing that is a post-mortem process where past incidents are analyzed and steps taken to avoid them recurring in the future, so that’s what we do!

Triple Pat runs its post-mortem process largely in the open. We have a monitoring process (run by Better Stack) and they host a public status page for us of the server statuses they measure. The status page can be found at status.triplepat.com. On that main status page, you can see past incidents where some servers weren’t behaving as well as we would have liked.

If you would like to learn more about each incident, you can check out our page of incident reports! Every outage is categorized and described. Some are just acknowledged, but most end up causing changes in how we deploy and run our systems.

From the status page, you can see that not only is our service reliable (because we operate in master-master mode, in order to have an outage, every server must be down simultaneously, and that has never happened), but also that as things happen to our service components, we learn and take action to prevent repeats.

The errors cause both immediate fixes (like rolling back to a previously-working configuration), but they also cause long-term preventions (like adding validation steps to prevent the rollout of broken configs). Reliability comes from good decisions, good decisions come from experience, and experience comes from learning from bad decisions. You can watch along as our organization learns more about where our systems tend to break and what we can do to prevent and/or automatically mitigate that breakage.

We’re proud of our current 100% uptime record for the Check-In service, but we’re even more proud that we’re constantly learning and improving. Doing this helps ensure that our uptime isn’t just luck, it’s the result of systematic learning from past incidents and continuous investment in increased resilience.

Contribute to the discussion by replying on BlueSky!