Runtime dependencies

We run a SaaS business for software reliability, which means we want others to depend on us. In order for others to depend on us, we need a demonstrated record of uptime and reliability, and we need to be clear about what we depend on at runtime to deliver that service.

In our keep it boring post, we listed our dependencies. Some of those were services, and some were libraries. For the services, some we depend on to work at least once a month, and some we depend on to work basically all the time. As we get closer to launch it’s a good time to revisit our dependencies and be explicit about them and see if we are happy with the current state of affairs.

Must work all the time: at least one cloud provider

If all of our cloud providers go down in multiple regions simultaneously (that’s 5 machines spread across AWS, GCP, and tiny EU-based TILAA), Triple Pat is down and we can’t do anything about that. That said, it seems vanishingly unlikely that they would all go down simultaneously, and our cockroach-like strategy of having each server be a master DB means that we can survive as long as at least one cloud provider is up.

The critical thing here is that our dependencies on cloud providers is an OR not an AND. If one cloud provider goes down, we are still up. If two of them go down, we are still up. If four regions across three providers go down, we are still up. We need at least one region of at least one cloud provider to be up.

Must work almost all the time: at least one DNS provider

DNS is scary because it’s always DNS and DNS is a critical dependency of the Internet. We use Godaddy as our registrar and our provider for triplepat.com and it’s competitor CloudFlare as our registrar and provider for triplepat.net. Our phone apps will try names from both domains before giving up, and so will all check-ins that follow best practices. If DNS goes down then every system that does not have a cached DNS record for our systems can no longer contact our systems, but users can still contact our systems as long as at least one of our DNS providers is up.

We can survive a few minutes of DNS downtime every now and then thanks to DNS caching. We have two independent names using two different registrars and providers, so it would have to be a pretty major disaster to take both of them down for more than a few minutes.

Must work pretty well: Tailscale

We use Tailscale for our internal network to allow our DBs to sync their records. Tailscale has a good design where even if the Tailscale servers go down for a bit, it is highly likely that our internal traffic will still flow. This means that we can survive periods of Tailscale downtime without worry. Their robustness to their own failure is part of the reason we use them! If they went down for days we would be in trouble, but we don’t expect that to happen. If Tailscale goes down for an extended period, we have a manual failover process that we can use to route around their service.

Must work at least once every 3 months: LetsEncrypt

We use LetsEncrypt to get SSL certificates for our domains. We renew our certificates regularly. If we don’t have a valid certificate, we can’t serve HTTPS traffic. This is a critical dependency for us. That said, the certificates are valid for 3 months, and we renew them at least weekly. So we can survive a few months of LetsEncrypt downtime. If LetsEncrypt disappears, we have months to fix it before our users notice a problem.

That’s all!

In steady state, those are all of our runtime dependencies for check-ins. As a company, we rely on more services that don’t directly affect users’ check-ins:

The bottom line is that for our users (you!), as long as at least one cloud provider is working and at least one DNS provider is working, then our service, our app, and your check-ins should all work just fine. Anything else being down might inconvenience us, but you should be fine.

Contribute to the discussion by replying on BlueSky!