Reliable DNS

The Triple Pat check-in service is a service centered around helping people ensure that their software operations alerts are working. This means that we are measuring the reliability of something that is, itself, very reliable. Just like you need a lot of precision to measure something that is very precisely made, you need a lot of reliability to measure a reliable system. Every use of our APIs uses the names of our servers, which means that our system is only reliable as our names are, and our names are provided by DNS.

The Domain Name System (DNS) seems simple in concept: a database mapping names to IP addresses. Unfortunately, there are only two hard things in computer science: cache invalidation and naming things. And DNS is a distributed cached database of names.

This means that, despite superb efforts all around, DNS doesn’t work right sometimes. It’s always in some sense operator error, but those errors happen often enough that we can also blame the tool for providing a gun and aiming it at the operator’s foot. DNS is so often the culprit that we even have websites like isitdns.com and dnshaiku.com. However, at Triple Pat we want to build a reliable system that provides an external API. That API can be accessed in two main ways: by email and by URL. Both methods depend on DNS. So how do we set up our DNS for reliability?

Well, first we have reliability through redundancy. Everything available through triplepat.com is also available through triplepat.net. We have mirrored the records so that if DNS seems down for one, try the other and it should hopefully be up. We enforce the independence of these by buying the names and hosting the DNS through rival companies. GoDaddy is in charge of registration and DNS operations for triplepat.com, and CloudFlare is in charge of registrations and DNS operations for triplepat.net. These two companies are direct competitors, and as such they are unlikely to share infrastructure. So we have achieved independent redundant reliability via incentive-compatibility with large corporations.

Next, for our mail operations to uuid@checkin.triplepat.com and uuid@checkin.triplepat.net, we have MX-type DNS records specifying the mail servers for both checkin.triplepat.net and checkin.triplepat.com. Because we can receive email at any of our mirrors, we make the MX record for each TLD refer to every one of:

and we have made the time-to-live (TTL) for the MX records be as long as each provider allows. This means that email service is highly likely to keep working, even if one TLD goes down, because the long-lived records mean that mail for one can be handled by the other, and that fact can be served out of cache. So, once you have emails flowing, it is highly likely that they will be able to continue to flow even through DNS disruptions.

So we achieve DNS reliability by having multiple providers, and, for the MX (mail) records, by having long-lived cache entries that allow each TLD to back up the other (or not). To see this in action for our systems, use the dig tool!

$ dig checkin.triplepat.net MX
...
;; ANSWER SECTION:
checkin.triplepat.net.	86400	IN	MX	10 a.triplepat.net.
checkin.triplepat.net.	86400	IN	MX	10 c.triplepat.com.
checkin.triplepat.net.	86400	IN	MX	10 b.triplepat.com.
checkin.triplepat.net.	86400	IN	MX	10 a.triplepat.com.
checkin.triplepat.net.	86400	IN	MX	10 b.triplepat.net.
checkin.triplepat.net.	86400	IN	MX	10 d.triplepat.net.
checkin.triplepat.net.	86400	IN	MX	10 d.triplepat.com.
checkin.triplepat.net.	86400	IN	MX	10 c.triplepat.net.

and

$ dig triplepat.com MX
---
;; ANSWER SECTION:
checkin.triplepat.com.	86400	IN	MX	10 a.triplepat.com.
checkin.triplepat.com.	86400	IN	MX	10 c.triplepat.com.
checkin.triplepat.com.	86400	IN	MX	10 d.triplepat.com.
checkin.triplepat.com.	86400	IN	MX	10 a.triplepat.net.
checkin.triplepat.com.	86400	IN	MX	10 b.triplepat.com.
checkin.triplepat.com.	86400	IN	MX	10 b.triplepat.net.
checkin.triplepat.com.	86400	IN	MX	10 d.triplepat.net.
checkin.triplepat.com.	86400	IN	MX	10 c.triplepat.net.

From this we can see both sets of names in both responses. Our DNS entries for a,b,c, and d are shorter-lived because we want to be able to migrate servers in less than a day, but the MX record means that the mail sender can failover between TLDs transparently to you, the service user.

No matter which TLD you use for the email, as long as the MX record is in the cache (and to get in the cache, it needs to have worked once in the past 24 hours), your email will get routed to either of the two TLDs.

Even better, it is highly likely that at least one TLD is functioning properly, because the providers are independent. This means that if one of them is functioning correctly, the service should still work. If we find out that two are not enough, then we can easily add a third TLD, but for now we feel pretty confident that two independent DNS registrars/providers is enough to get us the nines we need to provide a reliable service.

Contribute to the discussion by replying on BlueSky!