Our Mission
Help techies rest easy by providing a "sleeper dike" that is maximally independent of their other systems.
Help techies rest easy...
Running software systems is a very stressful job. We know this because of the high degree of burnout in the industry. If we agree that the burnout rates are a problem, then we can solve them in one of two ways: changing the systems to worry the people less or changing the people to care less. While we hope that tech workers are caring for their mental health and that their employers are supportive of their efforts, we think it is also worthwhile to ask how the system themselves can be improved to support a sense of ease and confidence that enables healthy sleep.
...by providing a "sleeper dike"...
We are a dutch company, and the dutch know what it is like to depend on systems and yet still need to sleep. Millions of people sleep easy in the Netherlands secure in the dike system holding back the seas and lakes from flooding their towns, and they have done this for centuries. This is because1 all the dikes had a backup or "sleeper" dike. So even if the first dike failed (and they very seldom failed) there was a second dike waiting to take up the effort.
Most well-run software systems have a primary alerting system - these systems are named things like Prometheus or Nagios, and they send alerts by programmatic integration into slack or email or a phone alerting service like PagerDuty. The question "is the alerting system working?" keeps many of the best operators up at night. Because how would they know? This can be dealt with by perhaps running two alert systems in parallel and having one alert if the other disappears, but this just kicks the can a little farther down the line.
...that is maximally independent of their other systems.
Worse, what operators are trying to defend against with this question is the worry that something outside of their experience and knowledge is breaking things. They are trying to defend against themselves and the limits of their own knowledge! Trying to allay the fear that something small might be fundamentally wrong with how they are either deploying or monitoring or alerting. To allay this fear, what is required is a second monitoring system run independently of how they run their system.
Our service
Triple Pat provides this system through radical simplicity and by running the service outside of the operator's systems. The responsibility of the operator is to write a small cron job that retrieves a Triple Pat check-in service URL when everything is currently fine. The Triple Pat check-in service then watches that update timestamp and causes their phone to alert if the timestamp ever becomes too old. In hardware this is called a "watchdog timer" and in software it is sometimes called a "dead-man's switch", but we are not worried about theft or death, so we call ours a "check-in service", mirroring what teenagers had to do before cell-phones when they stayed out late. You can stay out late, but make sure to call and check-in every two hours.
The check-in service runs on Triple Pat's computers, and the app runs on the operator's cell phone. These should not be critical components in existing alerts, which means that operators can be confident that failures of the alerting system should happen independently from failures of the Triple Pat system.
For more on how we keep our system reliable (spoiler: radical simplicity and extreme redundancy), we have an in-depth description of our design.
- Not solely because of the sleeper dikes, of course. History is a rich tapestry and this is a simplifying metaphor. But the sleeper dike system helped.