<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Triple Pat Blog</title><link>https://triplepat.com/</link><description>Latest blog posts from Triple Pat</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><managingEditor>peter@triplepat.com (Peter Boothe)</managingEditor><webMaster>peter@triplepat.com (Peter Boothe)</webMaster><atom:link href="https://triplepat.com/feed.xml" rel="self" type="application/rss+xml"/><item><title>Memoryless Scheduling Should Be Your Default</title><link>https://triplepat.com/blog/2026/04/04/memoryless</link><pubDate>Sat, 04 Apr 2026 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2026/04/04/memoryless</guid><description>&lt;p&gt;When someone says &amp;ldquo;run this hourly,&amp;rdquo; what do they actually mean?&lt;/p&gt;
&lt;p&gt;Most developers reach for &lt;code&gt;0 * * * *&lt;/code&gt; in cron, or &lt;code&gt;time.Tick(time.Hour)&lt;/code&gt; in Go, or &lt;code&gt;setInterval(fn, 3600000)&lt;/code&gt; in JavaScript. These all fire at fixed intervals. But fixed intervals are rarely what you actually want. They&amp;rsquo;re just what the tools make easy.&lt;/p&gt;
&lt;p&gt;The better default is &lt;em&gt;memoryless&lt;/em&gt; scheduling: intervals drawn from an exponential distribution. Your task still runs about once per hour on average, but the exact timing is random. This post argues that memoryless should be your default, and fixed intervals should require justification.&lt;/p&gt;
&lt;h2 id="whats-wrong-with-fixed-intervals"&gt;What&amp;rsquo;s wrong with fixed intervals?&lt;/h2&gt;
&lt;p&gt;Three things:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Accidental synchronization.&lt;/strong&gt; Two systems that both run &amp;ldquo;every hour&amp;rdquo; will eventually sync up and hit shared resources simultaneously. This causes thundering herds, lock contention, and correlated failures. The more systems you have, the worse this gets. People try to work around this with hacks like &amp;ldquo;run this every minute at the 15 second mark&amp;rdquo; and &amp;ldquo;run that at the 24 second mark.&amp;rdquo; But all it takes is one &amp;ldquo;run this every 17 seconds&amp;rdquo; and your perfect arrangement is lost.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Correlation with periodic behavior.&lt;/strong&gt; Many systems have periodic patterns—daily traffic spikes, hourly batch jobs, garbage collection cycles. Fixed-interval sampling can systematically miss or over-represent states that correlate with your interval. Your &amp;ldquo;hourly&amp;rdquo; health check might always run during a quiet period and miss every traffic spike.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Biased sampling.&lt;/strong&gt; There&amp;rsquo;s a theorem in queueing theory called &lt;a href="https://en.wikipedia.org/wiki/Arrival_theorem#Theorem_for_arrivals_governed_by_a_Poisson_process"&gt;PASTA&lt;/a&gt;: Poisson Arrivals See Time Averages. It says that if you sample a system at random (Poisson-distributed) times, your samples are an unbiased representation of the system&amp;rsquo;s time-averaged state. Fixed-interval sampling doesn&amp;rsquo;t have this property.&lt;/p&gt;
&lt;h2 id="what-is-memoryless-scheduling"&gt;What is memoryless scheduling?&lt;/h2&gt;
&lt;p&gt;What we want is a way to guarantee that events happen at an average of once per hour, but where you cannot predict exactly when the next event will happen even if you know when all the previous events happened. This is called the &amp;ldquo;memoryless&amp;rdquo; property, and there&amp;rsquo;s only one statistical process that has it: the Poisson process.&lt;/p&gt;
&lt;p&gt;A Poisson process generates events where the inter-arrival times follow an exponential distribution. In Python, it&amp;rsquo;s one line:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;import&lt;/span&gt; random
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;import&lt;/span&gt; time
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Sleep for a random duration averaging 1 hour&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;time&lt;span style="color:#f92672"&gt;.&lt;/span&gt;sleep(random&lt;span style="color:#f92672"&gt;.&lt;/span&gt;expovariate(&lt;span style="color:#ae81ff"&gt;1.0&lt;/span&gt; &lt;span style="color:#f92672"&gt;/&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;3600.0&lt;/span&gt;))
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The &lt;code&gt;expovariate&lt;/code&gt; function returns exponentially-distributed random numbers. The parameter is the rate (events per second), so &lt;code&gt;1.0 / 3600.0&lt;/code&gt; gives you an average of one event per hour.&lt;/p&gt;
&lt;p&gt;In Go, the equivalent is:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;import&lt;/span&gt; (
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;math/rand&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;time&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// Sleep for a random duration averaging 1 hour&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;duration&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Duration&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;rand&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;ExpFloat64&lt;/span&gt;() &lt;span style="color:#f92672"&gt;*&lt;/span&gt; float64(&lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Hour&lt;/span&gt;))
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Sleep&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;duration&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;That&amp;rsquo;s it. That&amp;rsquo;s the core of memoryless scheduling.&lt;/p&gt;
&lt;h2 id="a-production-ready-implementation"&gt;A production-ready implementation&lt;/h2&gt;
&lt;p&gt;The one-liner works, but production code needs bounds. True exponential distributions are unbounded—you might get an interval of 10 hours or 10 milliseconds. Operationally, you often want guarantees.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s the implementation we use at Triple Pat. It uses our &lt;a href="https://triplepat.com/blog/2025/02/28/clocks"&gt;Clock interface&lt;/a&gt; instead of calling &lt;code&gt;time&lt;/code&gt; directly, which makes it testable:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;package&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;memoryless&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;import&lt;/span&gt; (
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;context&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;math/rand&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;time&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;github.com/triple-pat/go/clocks&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;type&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;Ticker&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;struct&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;Clock&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;clocks&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Clock&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;Expected&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Duration&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;Min&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Duration&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;Max&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Duration&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; (&lt;span style="color:#a6e22e"&gt;t&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;Ticker&lt;/span&gt;) &lt;span style="color:#a6e22e"&gt;randomWaitTime&lt;/span&gt;() &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Duration&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;var&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;wt&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Duration&lt;/span&gt; = &lt;span style="color:#f92672"&gt;-&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;// Resample until within bounds&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;for&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;wt&lt;/span&gt; &amp;lt; &lt;span style="color:#a6e22e"&gt;t&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Min&lt;/span&gt; &lt;span style="color:#f92672"&gt;||&lt;/span&gt; (&lt;span style="color:#a6e22e"&gt;t&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Max&lt;/span&gt; &lt;span style="color:#f92672"&gt;!=&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;wt&lt;/span&gt; &amp;gt; &lt;span style="color:#a6e22e"&gt;t&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Max&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;wt&lt;/span&gt; = &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Duration&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;rand&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;ExpFloat64&lt;/span&gt;() &lt;span style="color:#f92672"&gt;*&lt;/span&gt; float64(&lt;span style="color:#a6e22e"&gt;t&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Expected&lt;/span&gt;))
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;wt&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; (&lt;span style="color:#a6e22e"&gt;t&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;Ticker&lt;/span&gt;) &lt;span style="color:#a6e22e"&gt;Tick&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;context&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Context&lt;/span&gt;) &lt;span style="color:#f92672"&gt;&amp;lt;-&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;chan&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Time&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;c&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; make(&lt;span style="color:#66d9ef"&gt;chan&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Time&lt;/span&gt;) &lt;span style="color:#75715e"&gt;// Unbuffered is important&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;go&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;func&lt;/span&gt;() {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;defer&lt;/span&gt; close(&lt;span style="color:#a6e22e"&gt;c&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;for&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Err&lt;/span&gt;() &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;duration&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;t&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;randomWaitTime&lt;/span&gt;()
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;select&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;case&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;now&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;-&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;t&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Clock&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;After&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;duration&lt;/span&gt;):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;select&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;case&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;c&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;-&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;now&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;default&lt;/span&gt;: &lt;span style="color:#75715e"&gt;// Don&amp;#39;t block if receiver is busy&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;case&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;-&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Done&lt;/span&gt;():
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }()
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;c&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A few design choices worth noting:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Resampling vs. clamping.&lt;/strong&gt; When we get a value outside bounds, we resample rather than clamp. Clamping (forcing out-of-bounds values to Min or Max) introduces spikes at the boundaries. Resampling preserves the distribution shape within the allowed range.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Unbuffered channel.&lt;/strong&gt; If the receiver is busy when a tick arrives, we drop it rather than queueing. This prevents ticks from &amp;ldquo;bunching up&amp;rdquo; if the receiver falls behind.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Clock interface.&lt;/strong&gt; Using a &lt;a href="https://triplepat.com/blog/2025/02/28/clocks"&gt;Clock&lt;/a&gt; instead of calling &lt;code&gt;time.After&lt;/code&gt; and &lt;code&gt;time.Now&lt;/code&gt; directly means we can test the ticker&amp;rsquo;s mechanics deterministically without waiting for real time to pass. The patterns compose: clocks make memoryless testable, and memoryless makes distributed systems reliable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Min and Max are optional.&lt;/strong&gt; Set Max to 0 to disable the upper bound. This lets you use the same code for bounded and unbounded cases.&lt;/p&gt;
&lt;h2 id="testing-randomness"&gt;Testing randomness&lt;/h2&gt;
&lt;p&gt;How do you test that your random numbers are actually random? You use statistics.&lt;/p&gt;
&lt;p&gt;The mechanical behavior—bounds enforcement, channel semantics, context cancellation—is tested deterministically using a &lt;code&gt;Blocking&lt;/code&gt; clock. But to verify that the &lt;em&gt;distribution&lt;/em&gt; is actually exponential, we use a &lt;a href="https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test"&gt;Kolmogorov-Smirnov test&lt;/a&gt;. It generates 10,000 samples and compares the empirical distribution against the theoretical one:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;TestWaitTimeDistribution&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;t&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;testing&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;T&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;testing&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Short&lt;/span&gt;() {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;t&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Skip&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;skipping statistical significance test&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;cfg&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;Ticker&lt;/span&gt;{&lt;span style="color:#a6e22e"&gt;Expected&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;10&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Millisecond&lt;/span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;const&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;sampleSize&lt;/span&gt; = &lt;span style="color:#ae81ff"&gt;10_000&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;intervals&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; make([]&lt;span style="color:#66d9ef"&gt;float64&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;sampleSize&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;for&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;i&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;range&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;intervals&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;intervals&lt;/span&gt;[&lt;span style="color:#a6e22e"&gt;i&lt;/span&gt;] = &lt;span style="color:#a6e22e"&gt;cfg&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;randomWaitTime&lt;/span&gt;().&lt;span style="color:#a6e22e"&gt;Seconds&lt;/span&gt;()
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;// ... compute mean, lambda, sort, then for each sample:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;// compare empirical CDF vs theoretical CDF (1 - e^(-λx))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;// track the maximum difference ...&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;// Critical value for 99% confidence&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;criticalValue&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1.05&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;math&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Sqrt&lt;/span&gt;(
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;-&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;math&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Log&lt;/span&gt;(&lt;span style="color:#ae81ff"&gt;0.005&lt;/span&gt;) &lt;span style="color:#f92672"&gt;/&lt;/span&gt; (&lt;span style="color:#ae81ff"&gt;2&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt; float64(&lt;span style="color:#a6e22e"&gt;sampleSize&lt;/span&gt;)))
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;require&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Less&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;t&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;maxD&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;criticalValue&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;K-S test failed: distribution doesn&amp;#39;t match exponential&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This test will spuriously fail about 1% of the time—that&amp;rsquo;s what a 99% confidence interval means. So we skip it in &lt;code&gt;-short&lt;/code&gt; mode and accept the inherent flakiness. This is a reasonable tradeoff: strong statistical verification without making CI unreliable.&lt;/p&gt;
&lt;h2 id="when-do-you-actually-need-fixed-intervals"&gt;When do you actually need fixed intervals?&lt;/h2&gt;
&lt;p&gt;Fixed intervals aren&amp;rsquo;t always wrong. Here are legitimate reasons to use them:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;External coordination.&lt;/strong&gt; If you need to sync with an external system that expects requests at specific times, you need fixed intervals.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Human expectations.&lt;/strong&gt; Daily reports that arrive &amp;ldquo;around 9am, give or take a few hours&amp;rdquo; will confuse people. Some things genuinely need to happen at predictable times.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Rate limiting.&lt;/strong&gt; If you&amp;rsquo;re allowed exactly N requests per hour by an external API, you might need fixed intervals to stay within limits.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Debugging.&lt;/strong&gt; Fixed intervals are easier to reason about when tracking down timing issues.&lt;/p&gt;
&lt;p&gt;But notice these are all about external constraints. For internal operations—garbage collection, cache invalidation, health checks, metric generation, background sync—memoryless is almost always better.&lt;/p&gt;
&lt;h2 id="real-world-applications"&gt;Real-world applications&lt;/h2&gt;
&lt;p&gt;I&amp;rsquo;ve used this pattern for years across different projects:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Internet speed tests.&lt;/strong&gt; When measuring broadband quality, you want samples that represent actual usage patterns. Fixed hourly tests might always miss peak congestion. Memoryless sampling gives unbiased measurements. (See &lt;a href="https://github.com/pboothe/mlab-test-runner"&gt;mlab-test-runner&lt;/a&gt; for an early implementation.)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Synthetic alert generation.&lt;/strong&gt; At Triple Pat, we generate fake alerts on a memoryless schedule to test that our alerting pipeline works. Because Prometheus scrapes at fixed intervals, making our signal generation memoryless ensures the scraper sees our test alerts with unbiased probability.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mirror synchronization.&lt;/strong&gt; Our distributed check-in service has multiple servers that need to sync with each other. Memoryless sync intervals prevent them from all trying to sync simultaneously.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Database cleanup.&lt;/strong&gt; Expired records need to be deleted, but it doesn&amp;rsquo;t matter exactly when. Memoryless scheduling spreads the load and avoids coordinated spikes.&lt;/p&gt;
&lt;h2 id="the-industry-is-getting-this-wrong"&gt;The industry is getting this wrong&lt;/h2&gt;
&lt;p&gt;Every major observability tool—Prometheus, OpenTelemetry, Grafana Alloy—scrapes metrics at fixed intervals. This means the entire industry is collecting potentially biased samples by default.&lt;/p&gt;
&lt;p&gt;If your system has any periodic behavior that correlates with your scrape interval, your metrics may be systematically misleading. A service that&amp;rsquo;s slow for 10 seconds every minute might look perfectly healthy or constantly broken, depending on when your collector happened to start.&lt;/p&gt;
&lt;p&gt;The PASTA theorem tells us this is fixable: use Poisson-distributed scrape intervals. But none of the major collectors support this. It&amp;rsquo;s a gap in the ecosystem.&lt;/p&gt;
&lt;h2 id="summary"&gt;Summary&lt;/h2&gt;
&lt;p&gt;Memoryless scheduling is the right default for periodic tasks. Fixed intervals should require justification.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use memoryless when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The task is internal (no external coordination needed)&lt;/li&gt;
&lt;li&gt;The timing doesn&amp;rsquo;t need to be predictable to humans&lt;/li&gt;
&lt;li&gt;You want unbiased sampling of system state&lt;/li&gt;
&lt;li&gt;You have multiple instances that might accidentally synchronize&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use fixed intervals when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;External systems expect requests at specific times&lt;/li&gt;
&lt;li&gt;Humans need predictable schedules&lt;/li&gt;
&lt;li&gt;You&amp;rsquo;re working around rate limits&lt;/li&gt;
&lt;li&gt;You&amp;rsquo;re debugging timing issues&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The implementation is simple—a few lines of code. The hard part is remembering to use it instead of reaching for &lt;code&gt;cron&lt;/code&gt; or &lt;code&gt;time.Tick&lt;/code&gt; out of habit.&lt;/p&gt;
&lt;h2 id="use-it"&gt;Use it!&lt;/h2&gt;
&lt;p&gt;The full implementation is available as a &lt;a href="https://gist.github.com/pboothe/1acb4f33d5158754011d881151fdeada"&gt;gist&lt;/a&gt;. A similar implementation is also available in the &lt;a href="https://github.com/m-lab/go/blob/main/memoryless/memoryless.go"&gt;M-Lab Go library&lt;/a&gt;. Both are designed to be drop-in replacements for &lt;code&gt;time.Tick()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;All code in this blog post is &lt;a href="https://creativecommons.org/public-domain/cc0/"&gt;CC0&lt;/a&gt;, which means you can freely use it however you want.&lt;/p&gt;
&lt;p&gt;Happy (randomly-timed) coding!&lt;/p&gt;</description></item><item><title>Design of the checkin service</title><link>https://triplepat.com/blog/2025/06/20/design</link><pubDate>Fri, 20 Jun 2025 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2025/06/20/design</guid><description>&lt;h2 id="making-a-service-that-is-radically-simple-and-reliable"&gt;&amp;ldquo;Making a service that is radically simple and reliable&amp;rdquo;&lt;/h2&gt;
&lt;p&gt;The critical insight is that a service collecting &lt;code&gt;(timestamp, uuid)&lt;/code&gt; pairs
that only cares about the most recent timestamp for a given UUID is actually
maintaining a
&lt;a href="https://en.m.wikipedia.org/wiki/Conflict-free_replicated_data_type"&gt;crdt&lt;/a&gt;.
All that work that databases need to do in order to prevent conflicting updates
just goes out the window, and we can create a database that can be run in
parallel with multiple masters without having to worry about the &lt;a href="https://en.wikipedia.org/wiki/CAP_theorem"&gt;CAP
theorem&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Everything else is a consequence of that one insight, combined with a desire
for simplicity across the board. The rest of this document is about how we
built the backend system, and the principles we stuck to in the process.&lt;/p&gt;
&lt;h2 id="architecture"&gt;Architecture&lt;/h2&gt;
&lt;p&gt;We use a microservice architecture. Each service is a separate binary run in
its own separate Docker container. There are three main services: the user
service, the email service, and the checkin service. We implement and run
these services to support the phone apps (which run on iOS and Android on
user&amp;rsquo;s phones) and user checkins (which run inside users&amp;rsquo; infrastructure).&lt;/p&gt;
&lt;p&gt;The user service is necessarily more centralized because it needs to perform
actions exactly once. This means that the user service is inherently less
reliable, so we should design the phone apps to not need to contact it more
than once a week &amp;mdash; they should try every day (or maybe even more often?), but
things whould only start going sideways if they have been out of contact for a
week.&lt;/p&gt;
&lt;p&gt;The checkin service is distributed. Our fleet of 5+ checkin servers operate in
master-master mode. Each can be authoritative. The checkin-service must be
made reliable. We achieve that reliability through redundancy and running
things in multiple clouds. Users should always be able to contact a checkin
server to either perform a checkin or to retrieve the time of the last checkin.&lt;/p&gt;
&lt;p&gt;The email service is distributed the same way email is distributed: with MX DNS
records. Each box in our fleet is running an email server, and the SMTP
protocol supports failover from one MX recipent to the next. Receiving
check-in emails causes the email service to invoke the check-in service for the
appropriate UUID.&lt;/p&gt;
&lt;p&gt;Users use &lt;code&gt;{protocol}://{server}/api/v1/checkin/{uuid}&lt;/code&gt; to check-in and
&lt;code&gt;{protocol}://{server}/api/v1/getlastcheckin/{uuid}&lt;/code&gt; to get the last check-in for a
given UUID. &lt;code&gt;{protocol}&lt;/code&gt; can be http or https. &lt;code&gt;{server}&lt;/code&gt; can be any of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;triplepat.com triplepat.net&lt;/li&gt;
&lt;li&gt;a.triplepat.com a.triplepat.net&lt;/li&gt;
&lt;li&gt;b.triplepat.com b.triplepat.net&lt;/li&gt;
&lt;li&gt;c.triplepat.com c.triplepat.net&lt;/li&gt;
&lt;li&gt;d.triplepat.com d.triplepat.net&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The email service is also distributed, because the way SMTP and DNS MX records work
means that we can specify a number of email servers and the senders&amp;rsquo; MTAs will
automatically fail-over. We run one email-service container on every mirror.
This allows users to check-in by emailing &lt;code&gt;{uuid}@checkin.triplepat.com&lt;/code&gt; or
&lt;code&gt;{uuid}@checkin.triplepat.net&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;All our APIs, UUIDs, URLs, and email addresses are case-insensitive.&lt;/p&gt;
&lt;p&gt;As much as possible, we store all &amp;ldquo;brains&amp;rdquo; and user preferences in the phone
apps. That way, the user doesn&amp;rsquo;t have to contact us very often except for
checkins. The checkin server has no knowledge of users or userids. It just
stores &lt;code&gt;(UUID, timestamp)&lt;/code&gt; pairs. We never trust user clocks, so when users
perform a checkin, we store a timestamp generated by the server rather than one
from the user&amp;rsquo;s phone. Our servers&amp;rsquo; clocks are synced by our cloud providers.
We trust them more than we trust user clocks.&lt;/p&gt;
&lt;h2 id="defaults"&gt;Defaults&lt;/h2&gt;
&lt;p&gt;We use &lt;a href="https://boringtechnology.club/"&gt;default, boring technology&lt;/a&gt; whenever
possible. Every time we are clever, we should know the reason why.&lt;/p&gt;
&lt;p&gt;This means we use:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://sqlite.org/"&gt;SQLite&lt;/a&gt; for the &lt;code&gt;checkin-service&lt;/code&gt; database.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.postgresql.org/"&gt;PostgreSQL&lt;/a&gt; for the &lt;code&gt;user-service&lt;/code&gt; database.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tailscale.com/"&gt;Tailscale&lt;/a&gt; for internal networking.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://letsencrypt.org/"&gt;Letsencrypt&lt;/a&gt; for SSL/TLS certificates.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cert-manager.io/"&gt;cert-manager&lt;/a&gt; to manage them on each host.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nginx.org/"&gt;nginx&lt;/a&gt; for our public web server and reverse proxy.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docker.com/"&gt;Docker&lt;/a&gt; for containerization.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hub.docker.com/"&gt;Docker Hub&lt;/a&gt; for our container images.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/"&gt;Prometheus&lt;/a&gt; for monitoring and metrics.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://grafana.com/"&gt;Grafana&lt;/a&gt; for monitoring dashboards.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/"&gt;Github&lt;/a&gt; for our source code management.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/features/actions"&gt;Github Actions&lt;/a&gt; for CI/CD.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://golang.org/"&gt;Go&lt;/a&gt; for our server programming language.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kotlinlang.org/"&gt;Kotlin&lt;/a&gt; and &lt;a href="https://swift.org/"&gt;Swift&lt;/a&gt; for our mobile programming languages.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/"&gt;AWS&lt;/a&gt; and &lt;a href="https://cloud.google.com/"&gt;GCP&lt;/a&gt; for our cloud providers, and we only use tiny vanilla machines to avoid lock-in. We also use &lt;a href="https://www.tilaa.com/"&gt;TILAA&lt;/a&gt; to avoid lock-in to &amp;ldquo;just the big providers&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://godaddy.com/"&gt;Godaddy&lt;/a&gt; for our DNS provider for triplepat.com&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloudflare.com/"&gt;CloudFlare&lt;/a&gt; for our DNS for triplepat.net&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gohugo.io"&gt;Hugo&lt;/a&gt; for our static website generator.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We also use every available linter and formatter to make sure our code,
configs, and output are internally clean and consistent and align with the
expectations of the outside world. Every lint rule we use is one less rule we
need to remember, and human brain space is precious.&lt;/p&gt;
&lt;h2 id="reliability"&gt;Reliability&lt;/h2&gt;
&lt;p&gt;We acquire reliability via the &amp;ldquo;cockroach&amp;rdquo; strategy of many cheap replications
around the world. We run multiple instances and we run instances in multiple
clouds. To do this, we need to make sure the computers we rent in any given
cloud are not too expensive. We also need to monitor our systems because &amp;ldquo;if
you aren&amp;rsquo;t monitoring, you don&amp;rsquo;t have uptime&amp;rdquo;.&lt;/p&gt;
&lt;h2 id="efficiency"&gt;Efficiency&lt;/h2&gt;
&lt;p&gt;We keep our memory requirements low, because the cost of running a single
server directly translates to how many servers we can realistically run, and
more servers means more reliability.
&lt;a href="https://dictionary.archivists.org/entry/lots-of-copies-keep-stuff.html"&gt;LOCKSS&lt;/a&gt;
is the order of the day. Right now our services collectively (nginx + Alloy +
checkin-service) run fine on 1GB of RAM, which means we pay under $20/server/month.
Going to 2GB doubles the per-server cost and 4GB quadruples it, etc. For
services that don&amp;rsquo;t need much CPU, RAM is the cost driver, with network usage
lurking behind it. Our cockroach strategy means we need to respect RAM usage.&lt;/p&gt;
&lt;h2 id="storage"&gt;Storage&lt;/h2&gt;
&lt;p&gt;We use PostgreSQL for the centralized &lt;code&gt;user-service&lt;/code&gt; data. This allows for
better concurrency and scalability compared to file-based databases, which is
important for the user service&amp;rsquo;s role.&lt;/p&gt;
&lt;p&gt;For the distributed &lt;code&gt;checkin-service&lt;/code&gt;, we use SQLite. It is simple, small,
allows us to store data in a single file per instance, and we don&amp;rsquo;t have to
write our own serialization code. The synchronization between &lt;code&gt;checkin-service&lt;/code&gt;
instances is handled at the application level, not the database level.&lt;/p&gt;
&lt;p&gt;We avoid using database-specific features where practical to retain
flexibility. We use PostgreSQL for the user service because we need to be able
to write ad-hoc queries over our set of users, so we need the storage to be
safely readable and writable by other processes.&lt;/p&gt;
&lt;h2 id="logging"&gt;Logging&lt;/h2&gt;
&lt;p&gt;We log using &lt;a href="https://pkg.go.dev/log/slog"&gt;slog&lt;/a&gt;. Because of its flexibility,
we can swap out backends if we need to use someone else&amp;rsquo;s log processing
software (e.g. Logrus and the like), but outside of &lt;code&gt;main()&lt;/code&gt;, things are logged
to &lt;code&gt;slog&lt;/code&gt; or they aren&amp;rsquo;t logged. You should use the slog global logger, or
derive your custom logger from it, as every main function should set it up with
good settings.&lt;/p&gt;
&lt;h2 id="monitoring-and-metrics"&gt;Monitoring and Metrics&lt;/h2&gt;
&lt;p&gt;We set things up to be monitored with &lt;a href="https://prometheus.io/"&gt;Prometheus&lt;/a&gt;.
Its pull-based logging semantics make the most sense for us because it means
that &lt;a href="https://github.com/grafana/alloy"&gt;Alloy&lt;/a&gt; can choose how much/often to
scrape and we can tune things to make sure we don&amp;rsquo;t spend more than we intend
by sending too much to &lt;a href="https://grafana.com"&gt;Grafana&lt;/a&gt;. Also, pull-based
semantics force us to generate metrics that are sampleable, like counts,
instead of ephemeral metrics like gauge values. It is appropriate to force
the developer to choose the right metric type, because they are the one who
knows best what is being measured.&lt;/p&gt;
&lt;p&gt;Strongly prefer Counters and Histograms for metrics instead of Gauges.
Prometheus works by sampling metrics, and sampling a counter or histogram
coarsens data but sampling a gauge loses data.&lt;/p&gt;
&lt;p&gt;Every metric name should lint correctly with &lt;a href="https://pkg.go.dev/github.com/prometheus/client_golang/prometheus/testutil/promlint"&gt;promlint&lt;/a&gt;.
Every metric name should also end with the unit of measure or &lt;code&gt;_total&lt;/code&gt;. This
helps us build dashboards and combine values in a sensible manner. Use
base SI units when possible (&amp;ldquo;meters&amp;rdquo; instead of &amp;ldquo;kilometers&amp;rdquo;), and strongly prefer counts to gauges.&lt;/p&gt;
&lt;h2 id="docker"&gt;Docker&lt;/h2&gt;
&lt;p&gt;Every binary we run in production is run in a Docker container. This means
each of them need a Dockerfile in the same directory as their &lt;code&gt;main()&lt;/code&gt;
function. Please use
&lt;code&gt;checkin-service/Dockerfile&lt;/code&gt; as an example to get
you started. If you do it right, then you can keep compilation times low while
still getting the benefits of in-container unit testing.&lt;/p&gt;
&lt;p&gt;Every Dockerfile should do one thing. Containers are not VMs and should not be
used like VMs. We enforce this by using distroless images where we can and
only falling back to Alpine-linux images when we need ad-hoc debug tooling.&lt;/p&gt;
&lt;p&gt;Run one process per container, and try and invoke that process without forcing
the shell to parse a string to run it (use the array form of &lt;code&gt;ENTRYPOINT&lt;/code&gt;).
The process should run as &lt;code&gt;nobody&lt;/code&gt; or we should have a good reason why it
can&amp;rsquo;t.&lt;/p&gt;
&lt;p&gt;Use &lt;code&gt;ENTRYPOINT&lt;/code&gt; instead of &lt;code&gt;CMD&lt;/code&gt; in the Dockerfiles, because our Docker
images are meant to hold and run a single binary.&lt;/p&gt;
&lt;p&gt;Build and deploy in different containers using multi-stage builds. Among other
things, this keeps the deployed size of our images quite small (under 20MB).
Also, deploying a compiler alongside every service is wasteful of disk space
and bandwidth, but it also needlessly adds complicating pieces to our images.
We keep things simple by removing needless complications.&lt;/p&gt;
&lt;p&gt;Every Dockerfile should have a HEALTHCHECK directive in it. This allows
deployment to stall until a container is actually working instead of treating
container invocation as a &amp;ldquo;fire and forget&amp;rdquo; operation. There is a
&lt;code&gt;healthcheck&lt;/code&gt; directory containing helpful libraries and a
&lt;code&gt;healthcheck/check-health&lt;/code&gt; binary to perform http-based health-checks (like the
ones provided by the &lt;code&gt;healthcheck&lt;/code&gt; library). Also, the person building the
service knows best what the healthcheck should be, so it is appropriate to
make the Dockerfile author design it, and not the deployer.&lt;/p&gt;
&lt;p&gt;Each &lt;code&gt;Dockerfile&lt;/code&gt; should, as much as possible, start with exactly the same text
as the others. In particular, everything up through the running of all the
unit tests should be the same. This allows us to only run the unit tests once
when we build all our containers, because &lt;code&gt;docker build&lt;/code&gt; can use the cache-hit
from testing the first to skip the tests for the second.&lt;/p&gt;
&lt;h2 id="health-checks"&gt;Health checks&lt;/h2&gt;
&lt;p&gt;Every daemon run in a dockerfile should support health checks. The simplest
implementation of health checks can be found in our internal healthcheck
library. This contains a server to report health to the container, a binary to
connect to that server and report the health back to docker via its exit code,
and a simple interface for things that want to be checked.&lt;/p&gt;
&lt;p&gt;All of our docker images should have health checks, one way or another.&lt;/p&gt;
&lt;p&gt;It is easy to add a healthcheck to your service by implementing the
&lt;code&gt;healthcheck.CheckableService&lt;/code&gt; interface which has one method. You should
have a health check for every service and pass all such services to
the health check server when you start invoke it in &lt;code&gt;main()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;We should open-source this small library. It is so useful and good.&lt;/p&gt;
&lt;h2 id="test-coverage"&gt;Test coverage&lt;/h2&gt;
&lt;p&gt;All code should be 100% covered by unit tests. Getting uncovered code back to
being covered should be considered a high-priority task. 100% coverage doesn&amp;rsquo;t
guarantee no bugs, but it does guarantee that every line of code can be
executed without crashing in at least one context. Also, the difference
between 100% and 99% is psychologically much greater than the difference
between 99% and 98% or 79% and 78% &amp;mdash; once you start allowing coverage to
slip, it&amp;rsquo;s hard to get it back up and easy to let it get worse. Beyond simply
executing every line, striving for complete coverage encourages writing code
that is inherently more modular and designed to be testable. &amp;ldquo;Keep it 100&amp;rdquo;
and cover everything to make sure things stay working and the design stays
good.&lt;/p&gt;
&lt;h2 id="linting"&gt;Linting&lt;/h2&gt;
&lt;p&gt;All code is linted with
&lt;a href="https://github.com/golangci/golangci-lint"&gt;golangci-lint&lt;/a&gt;. All code has to
lint clean, and this is enforced by our Github Actions. You can put exceptions
in with comments, but the exceptions should be few and well-motivated. If we
find that a particular linter run by golangci-lint is requiring lots of
exceptions, we should decide to either turn that linter off or comply with it
everywhere.&lt;/p&gt;
&lt;p&gt;All prometheus metrics and Dockerfiles are linted too!&lt;/p&gt;
&lt;p&gt;Because we don&amp;rsquo;t have a big team and are often working under time-pressure and
&amp;ldquo;just trying to get it to work&amp;rdquo;, we want to use every linter possible.
Everything that can help us offload &amp;ldquo;smell tests&amp;rdquo; to automated machines is a
good idea.&lt;/p&gt;
&lt;h2 id="error-handling-and-recovery"&gt;Error handling and recovery&lt;/h2&gt;
&lt;p&gt;Feel free to use &lt;code&gt;rtx.Must&lt;/code&gt; and &lt;code&gt;rtx.ValueOrDie&lt;/code&gt; liberally for errors that are
genuinely unrecoverable or indicate a programming bug (e.g., parsing a
known-good embedded asset). Crashing immediately in these cases makes the
problem obvious and prevents potentially corrupted state. These helpers also
simplify testing, as you don&amp;rsquo;t need to write tests for error handling paths
that simply crash.&lt;/p&gt;
&lt;p&gt;All errors must be explicitly handled. Common handling strategies include:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Crashing:&lt;/strong&gt; Using &lt;code&gt;rtx.Must&lt;/code&gt; or &lt;code&gt;rtx.ValueOrDie&lt;/code&gt; for unrecoverable errors, as described above.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Returning the error:&lt;/strong&gt; Passing the error up the call stack for a higher-level function to handle. This is the standard Go approach and is perfectly acceptable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Handling the error:&lt;/strong&gt; Implementing specific logic to recover from the error or take alternative action within the current function.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Logging the error:&lt;/strong&gt; If an error cannot be handled appropriately by the current function (and returning it is not suitable, perhaps because it&amp;rsquo;s in a background goroutine or at the top level of a request handler), use &lt;code&gt;slog.Error(...)&lt;/code&gt; to record it. This increments a Prometheus counter, making unhandled operational errors visible.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Avoid both logging at error and then returning the same error, as this often leads to duplicate log entries. Choose one appropriate handling strategy for each error and stick to it. If you can handle the error, then it likely should cause a &lt;code&gt;Warn&lt;/code&gt; or &lt;code&gt;Info&lt;/code&gt; log message because it&amp;rsquo;s not a true error, it&amp;rsquo;s just something unexpected. Casually logging at the Error level causes alert spam and operator fatigue.&lt;/p&gt;
&lt;h2 id="command-line-arguments"&gt;Command line arguments&lt;/h2&gt;
&lt;p&gt;We build our Docker images with ENTRYPOINT set, which means we pass in
command-line arguments in list form as part of the &lt;code&gt;command:&lt;/code&gt; directive in
that appropriate YAML file.&lt;/p&gt;
&lt;h2 id="operations"&gt;Operations&lt;/h2&gt;
&lt;p&gt;We use &lt;a href="https://about.gitlab.com/topics/gitops/"&gt;GitOps&lt;/a&gt; as our operations and
deployment strategy. When you merge a new tag to the go repo, an image with
that tag is built and pushed to the Docker Hub. We use &lt;a href="https://github.com/features/actions"&gt;Github
Actions&lt;/a&gt; to build and push the images.&lt;/p&gt;
&lt;p&gt;Code repos run all tests and linters on every push, and build and push new
containers on every tag. Config repos push the configs to production on every
push or merge to &lt;code&gt;main&lt;/code&gt;. We acknowledge that code is configs and configs are
code, so the line is fuzzy, but the categorization in practice should be clear
in each case.&lt;/p&gt;
&lt;p&gt;Push to prod on merge to main:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;server-configs&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;website&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Build+Lint on merge, push on tag:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;go&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ios_app&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;android_app&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We keep our configurations, as much as possible, in the &lt;code&gt;server-configs&lt;/code&gt; repo
and deployed as the &lt;code&gt;compose.yml&lt;/code&gt; file that is run by &lt;code&gt;docker compose&lt;/code&gt;. If we
move to k8s, then it will be a different file name, but the principle is the
same: one file to organize them all.&lt;/p&gt;
&lt;p&gt;Deployments of containers always use &lt;code&gt;--wait&lt;/code&gt; to ensure that all the containers
are healthy before moving on to the next machine. In this way, bad configs that
break containers at start only take out one machine.&lt;/p&gt;
&lt;p&gt;We rollback just like we roll forward: push and merge to main.&lt;/p&gt;
&lt;h2 id="security"&gt;Security&lt;/h2&gt;
&lt;p&gt;We use &lt;a href="https://letsencrypt.org/"&gt;letsencrypt&lt;/a&gt; for SSL/TLS certificates. We
use &lt;a href="https://cert-manager.io/"&gt;cert-manager&lt;/a&gt; to manage them on each host. Each
host manages its own certificates because then we don&amp;rsquo;t have to worry about
running centralized secret provisioning. All our public services are behind
nginx, which terminates SSL/TLS and proxies requests to the appropriate internal
service.&lt;/p&gt;
&lt;p&gt;Our internal-only services are behind Tailscale. We only share internal
services between machines over our Tailscale network.&lt;/p&gt;
&lt;p&gt;In general, we prefer a &amp;ldquo;fail2slowdown&amp;rdquo; instead of an &amp;ldquo;ostrich&amp;rdquo; or &amp;ldquo;fail2ban&amp;rdquo;
strategy. We want to slow down attackers, but we also want to keep our users
happy. We want to be able to detect and respond to attacks, but we also want
to be able to detect and respond to normal traffic. Public services should be
wrapped in a handler to slow down responses to IPs that have sent us erroneous
requests. This prevents us from being overwhelmed by bad requests and prevents
rumplestiltzkin attacks. That said, people developing against our service will
often make lots of bad requests before figuring out how to make the good ones
they want to use, so we don&amp;rsquo;t want to ban them.&lt;/p&gt;
&lt;p&gt;Each bad request from an IP causes all responses to that IP to be slower by
1 second for the next two hours, up to a max added delay of 10 seconds.&lt;/p&gt;
&lt;p&gt;We also prevent having too many simultaneous connections (good or bad) from
a single endpoint both in our nginx config, and in our Go code.&lt;/p&gt;
&lt;p&gt;We add a no-index and no-cache headers to all API responses.&lt;/p&gt;
&lt;h2 id="networking"&gt;Networking&lt;/h2&gt;
&lt;p&gt;Internal services are not exposed to the public internet, and are only exposed
over &lt;a href="https://tailscale.com"&gt;Tailscale&lt;/a&gt; connections. We use Tailscale for all
internal networking between hosts. Within a host, we use
&lt;a href="https://docs.docker.com/compose/"&gt;docker-compose&lt;/a&gt; to run services and it sets
up a virtual internal network for them.&lt;/p&gt;
&lt;p&gt;We use &lt;a href="https://godaddy.com"&gt;Godaddy&lt;/a&gt; for triplepat.com DNS (both registration and serving).
We use &lt;a href="https://cloudflare.com"&gt;Cloudflare&lt;/a&gt; for triplepat.net DNS (both registration and serving).
We have no CDN and do not want one.&lt;/p&gt;
&lt;p&gt;We have matching records for DNS on two different TLDs. We have www., @, MX,
a., b., c., and d. records for triplepat.net and triplepat.com. Those TLDs
are run by two different providers.&lt;/p&gt;
&lt;h2 id="website"&gt;Website&lt;/h2&gt;
&lt;p&gt;We use &lt;a href="https://gohugo.io/"&gt;Hugo&lt;/a&gt; to generate our website. Our website is
static. Comments are provided by linking to BlueSky posts.&lt;/p&gt;
&lt;h2 id="time"&gt;Time&lt;/h2&gt;
&lt;p&gt;Programming using time is often terrible for testing and a breeding ground for
subtle bugs. As much as possible, we use the &lt;code&gt;clocks&lt;/code&gt; package to provide a
clock that is easy to mock in tests. We use &lt;code&gt;clocks.System&lt;/code&gt; for all
time-related operations in production, and we use &lt;code&gt;clocks.Static&lt;/code&gt; or
&lt;code&gt;clocks.Blocking&lt;/code&gt; for all time-related operations in tests. We use the
&lt;code&gt;.Sleep()&lt;/code&gt;, &lt;code&gt;.Now()&lt;/code&gt;, and &lt;code&gt;.After()&lt;/code&gt; methods associated with those clocks
instead of &lt;code&gt;time.Sleep()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Ideally, &lt;code&gt;time.Now()&lt;/code&gt; and &lt;code&gt;time.Sleep()&lt;/code&gt; (and the helper methods in &lt;code&gt;time&lt;/code&gt; which
call &lt;code&gt;.now()&lt;/code&gt;, like &lt;code&gt;Until()&lt;/code&gt; and &lt;code&gt;Since()&lt;/code&gt;) should never appear in the codebase
outside of the &lt;code&gt;clocks&lt;/code&gt; package and &lt;code&gt;_test.go&lt;/code&gt; files.&lt;/p&gt;
&lt;h2 id="memoryless"&gt;Memoryless&lt;/h2&gt;
&lt;p&gt;For jobs that need to happen repeatedly, we want to prevent synchronization. The
best way of preventing synchronization is to ensure that the events occur in a
Poisson process. For that, we created &lt;code&gt;memoryless.Ticker{}&lt;/code&gt;, which has one
method, &lt;code&gt;Tick()&lt;/code&gt;, which produces a &lt;code&gt;&amp;lt;-chan time.Time&lt;/code&gt; just like &lt;code&gt;time.Tick()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Production code should use &lt;code&gt;memoryless.Ticker.Tick&lt;/code&gt; instead of &lt;code&gt;time.Tick&lt;/code&gt; or
&lt;code&gt;time.NewTicker&lt;/code&gt; or it should have a good explanation as to why.&lt;/p&gt;
&lt;h2 id="subscriptions"&gt;Subscriptions&lt;/h2&gt;
&lt;p&gt;Users log in to our services using &amp;ldquo;Log in with Apple&amp;rdquo; or &amp;ldquo;Log in with Google&amp;rdquo;.
This causes us to create an account for them if they don&amp;rsquo;t already have one.
They then get our free tier of service. In the free tier, you get 3 UUIDs, and
your UUIDs get scrubbed and deleted every month. If you want UUIDs to never
get deleted, you can subscribe to our paid plan.&lt;/p&gt;
&lt;p&gt;If you have a subscription, then you get UUIDs that are not scrubbed and are yours forever.&lt;/p&gt;
&lt;p&gt;We have a web-hook in user-service to accept notifications from the Apple App
Store about new subscriptions. That web-hook updates the state in our DB, and
it is the DB state that we will use for our scrubbing utility. The iPhone app
also reports its subscription state and supports sending the purchase receipt
to us for verification, so we try to make sure we never do harm to an existing
customer.&lt;/p&gt;
&lt;h1 id="summary"&gt;Summary&lt;/h1&gt;
&lt;p&gt;Keep it simple, respect CS theory, test everything, lint everything, and use automation everywhere. That&amp;rsquo;s how we built it.&lt;/p&gt;</description></item><item><title>100% test coverage makes your code simpler and better</title><link>https://triplepat.com/blog/2025/04/14/coverage</link><pubDate>Mon, 14 Apr 2025 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2025/04/14/coverage</guid><description>&lt;p&gt;Every line of Go code that runs Triple Pat is tested by some test.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;% go test ./... -cover&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;1&lt;/span&gt; -count&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;1&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; | awk &lt;span style="color:#e6db74"&gt;&amp;#39;{print $5}&amp;#39;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; | sort &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; | uniq -c
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#ae81ff"&gt;41&lt;/span&gt; 100.0%
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;That&amp;rsquo;s 41 packages, and every line is executed at least once in our automated test suite. Why is this good? Why is this not just a silly exercise in pedantic completionism? There&amp;rsquo;s two reasons: &amp;ldquo;&lt;a href="https://hypatia.ca/2017/07/18/the-al-capone-theory-of-sexual-harassment/"&gt;the Al Capone theory of sexual harassment&lt;/a&gt;&amp;rdquo; and the value of an intellectual forcing function.&lt;/p&gt;
&lt;h2 id="software-correctness-and-the-al-capone-theory-of-sexual-harassment"&gt;Software correctness and &amp;ldquo;The Al Capone theory of sexual harassment&amp;rdquo;&lt;/h2&gt;
&lt;p&gt;Al Capone went to jail not for being a mobster, but for cheating on his taxes. His dishonesty and criminality in one area implied that he was also criminal in others. &amp;ldquo;&lt;a href="https://hypatia.ca/2017/07/18/the-al-capone-theory-of-sexual-harassment/"&gt;The Al Capone theory of sexual harassment&lt;/a&gt;&amp;rdquo; is that sexist harassers in the workplace are likely to also be dishonest and bad for the business in many ways, and should be gotten rid of ASAP. Code that is bad is also likely to be bad in many ways.&lt;/p&gt;
&lt;p&gt;The most basic test is the smoke test, where you &amp;ldquo;turn it on and see if it catches fire&amp;rdquo;. In a software context, this means running the code and verifying that it didn&amp;rsquo;t cause a crash. This is a valuable first step towards correctness, especially for greenfield development, because bad code is often bad in many ways (like Al Capone was!). Even the most basic smoke test or in-out test is enough to uncover very real problems.&lt;/p&gt;
&lt;p&gt;As code ages and corner-cases are uncovered, the tests will naturally become more complex. But to start out it&amp;rsquo;s often enough to just make sure nothing crashes by fully covering the code with the most basic of tests.&lt;/p&gt;
&lt;h2 id="tests-as-intellectual-forcing-functions"&gt;Tests as intellectual forcing functions&lt;/h2&gt;
&lt;p&gt;Let&amp;rsquo;s start with a quote from Tony Hoare:&lt;/p&gt;
&lt;blockquote&gt;The real value of tests is not that they detect bugs in the code, but that they detect inadequacies in the methods, concentration, and skills of those who design and produce the code.&lt;/blockquote&gt;
&lt;p&gt;When you pursue 100% coverage you design your code for testability. You also find out that many errors are actually impossible (not just unlikely — impossible). A good example is the library &lt;a href="https://pkg.go.dev/google.golang.org/api/idtoken"&gt;&lt;code&gt;google.golang.org/api/idtoken&lt;/code&gt;&lt;/a&gt;, which has &lt;a href="https://pkg.go.dev/google.golang.org/api/idtoken#NewValidator"&gt;&lt;code&gt;idtoken.NewValidator(context.Context, ...ClientOption)&lt;/code&gt;&lt;/a&gt; that returns a value and an error. If you don’t pass in any client options, then it is literally impossible for that function to return an error. Therefore, you can (and should, IMO!) use &lt;code&gt;ValueOrDie&lt;/code&gt; on its return pair if you aren’t passing in any options.&lt;/p&gt;
&lt;p&gt;I would never have known that this was the case if I hadn’t committed to 100% code coverage - I would have just done a mindless error check that returns &lt;code&gt;(nil, err)&lt;/code&gt; to the caller. Indeed, that’s the code I wrote before I started trying to test it!&lt;/p&gt;
&lt;p&gt;Because the CI coverage report showed this codepath as untested, I tried to cause the error, and I discovered that the error was impossible. This was actually the only error-producing code in the function, so as a result I could also simplify the function I was writing. This insight allowed me to change the code from:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;NewVerifier&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;context&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Context&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;clientID&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;string&lt;/span&gt;) (&lt;span style="color:#a6e22e"&gt;Verifier&lt;/span&gt;, &lt;span style="color:#66d9ef"&gt;error&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;validator&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;idtoken&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;NewValidator&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#f92672"&gt;!=&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;verifier&lt;/span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;validator&lt;/span&gt;: &lt;span style="color:#a6e22e"&gt;validator&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;clientID&lt;/span&gt;: &lt;span style="color:#a6e22e"&gt;clientID&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }, &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;to:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;NewVerifier&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;context&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Context&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;clientID&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;string&lt;/span&gt;) &lt;span style="color:#a6e22e"&gt;Verifier&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;// An error is impossible because&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;// we&amp;#39;re not using a custom transport.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;validator&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;rtx&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;ValueOrDie&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;idtoken&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;NewValidator&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt;))
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;verifier&lt;/span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;validator&lt;/span&gt;: &lt;span style="color:#a6e22e"&gt;validator&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;clientID&lt;/span&gt;: &lt;span style="color:#a6e22e"&gt;clientID&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Pursuit of 100% test coverage forced me to think more deeply about my errors and how to cause them, which led to me discovering that I could make my code simpler, not just in this function but for its callers as well. The simplicity we found &amp;ldquo;cascaded upwards&amp;rdquo;! I would never have made this discovery without being forced to, however, because I am a lazy programmer just like everyone else. It was easy to write that error pass-through, so that&amp;rsquo;s what I did and that&amp;rsquo;s what most programmers would do. Without this goal of 100% coverage forcing me to think about things, I would not have thought hard enough to notice the opportunity for simplification.&lt;/p&gt;
&lt;p&gt;When you think about your code, your code becomes better. The problem is that thinking is hard and tiring, so you need a reason to do it. I recommend making 100% test coverage one of your reasons.&lt;/p&gt;
&lt;h2 id="tldr"&gt;TL;DR.&lt;/h2&gt;
&lt;p&gt;Every line of code should be executed at least once by your tests. 100% coverage doesn&amp;rsquo;t mean &amp;ldquo;no bugs&amp;rdquo; of course (and for code that needs to accomplish a task, you should definitely also have some assertions about the desired task being accomplished), but full coverage does mean that all remaining crashes are in some sense sneaky. It means they have to be joint test/code bugs, rather than just bugs in the code.&lt;/p&gt;
&lt;p&gt;Even &amp;ldquo;smoke tests&amp;rdquo; with no explicit assertions implicitly assert that the code they run doesn&amp;rsquo;t crash. 100% coverage helps keep your code free of simple bugs.&lt;/p&gt;
&lt;p&gt;When you try and make sure every line at least gets a smoke test, then you also end up finding that some errors are actually impossible. Impossible errors represent code that you don&amp;rsquo;t need to write, and complication you can avoid. 100% coverage helps keep your code simple.&lt;/p&gt;
&lt;p&gt;100% coverage helps keep code simple and free of simple bugs. Ultimately, the goal is to write correct code, and as Tony Hoare also said:&lt;/p&gt;
&lt;blockquote&gt;There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.
&lt;/blockquote&gt;
&lt;p&gt;In our twin pursuits of simplicity and correctness, full test coverage plays a valuable role.&lt;/p&gt;</description></item><item><title>Learning in public from our incidents</title><link>https://triplepat.com/blog/2025/03/31/incidents</link><pubDate>Mon, 31 Mar 2025 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2025/03/31/incidents</guid><description>&lt;p&gt;Triple Pat runs a service on computers and offers that service to the world. Because computers only mostly work, but we want our service to work all the time, we have to assemble a reliable service out of mostly-reliable components. A critical part of doing that is a post-mortem process where past incidents are analyzed and steps taken to avoid them recurring in the future, so that&amp;rsquo;s what we do!&lt;/p&gt;
&lt;p&gt;Triple Pat runs its post-mortem process largely in the open. We have a monitoring process (run by &lt;a href="https://betterstack.com"&gt;Better Stack&lt;/a&gt;) and they host a public status page for us of the server statuses they measure. The status page can be found at &lt;a href="https://status.triplepat.com"&gt;status.triplepat.com&lt;/a&gt;. On that main status page, you can see past incidents where some servers weren&amp;rsquo;t behaving as well as we would have liked.&lt;/p&gt;
&lt;p&gt;If you would like to learn more about each incident, you can check out our page of &lt;a href="https://status.triplepat.com/en/incidents"&gt;incident reports&lt;/a&gt;! Every outage is categorized and described. Some are just acknowledged, but most end up causing changes in how we deploy and run our systems.&lt;/p&gt;
&lt;p&gt;From the status page, you can see that not only is our service reliable (because we operate in master-master mode, in order to have an outage, every server must be down simultaneously, and that has never happened), but also that as things happen to our service components, we learn and take action to prevent repeats.&lt;/p&gt;
&lt;p&gt;The errors cause both immediate fixes (like rolling back to a previously-working configuration), but they also cause long-term preventions (like adding validation steps to prevent the rollout of broken configs). Reliability comes from good decisions, good decisions come from experience, and experience comes from learning from bad decisions. You can watch along as our organization learns more about where our systems tend to break and what we can do to prevent and/or automatically mitigate that breakage.&lt;/p&gt;
&lt;p&gt;We&amp;rsquo;re proud of our current 100% uptime record for the Check-In service, but we&amp;rsquo;re even more proud that we&amp;rsquo;re constantly learning and improving. Doing this helps ensure that our uptime isn&amp;rsquo;t just luck, it&amp;rsquo;s the result of systematic learning from past incidents and continuous investment in increased resilience.&lt;/p&gt;</description></item><item><title>Linting your Go code</title><link>https://triplepat.com/blog/2025/03/26/linting</link><pubDate>Wed, 26 Mar 2025 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2025/03/26/linting</guid><description>&lt;p&gt;You should lint your code! Pretty much everyone agrees that you should do this. Linters are tools of varying degrees of sophistication that attempt to help you avoid dangerous situations with your code, and coding is hard enough that such automated help is invaluable.&lt;/p&gt;
&lt;p&gt;The Go programming language was designed to have a relatively simple syntax and to compile quickly, which means it is also straightforward to write a linter for the language. So lots of people did! Eventually their efforts got collected in a meta-linter called &lt;a href="https://golangci-lint.run/"&gt;golangci-lint&lt;/a&gt;. That linter just had version 2 released, so its a good time to talk about its configs.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s &lt;a href="https://gist.github.com/pboothe/7babfa2c89aaa00859c3dcf8f62bce8d"&gt;the configs for Triple Pat&amp;rsquo;s Go code&lt;/a&gt;. The config is actually a pretty strict one, and I encourage you to be strict when setting up a new project. Linting cleanly is hard to put in an existing project, but is easy to do in the beginning. Start your projects off right!&lt;/p&gt;
&lt;p&gt;How does our config differ from yours? What should we do differently?&lt;/p&gt;</description></item><item><title>Reliable DNS</title><link>https://triplepat.com/blog/2025/03/04/dns</link><pubDate>Tue, 04 Mar 2025 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2025/03/04/dns</guid><description>&lt;p&gt;The Triple Pat check-in service is a service centered around helping people ensure that their
software operations alerts are working. This means that we are measuring the reliability of
something that is, itself, very reliable. Just like you need a lot of precision to measure
something that is very precisely made, you need a lot of reliability to measure a reliable
system. Every use of our APIs uses the names of our servers, which means that our system
is only reliable as our names are, and our names are provided by DNS.&lt;/p&gt;
&lt;p&gt;The Domain Name System (DNS) seems simple in concept: a database mapping names to IP addresses.
Unfortunately, &lt;a href="https://www.karlton.org/2017/12/naming-things-hard/"&gt;there are only two hard things in
computer science: cache invalidation and naming things&lt;/a&gt;.
And DNS is a distributed cached database of names.&lt;/p&gt;
&lt;p&gt;This means that, despite superb efforts all around, DNS doesn&amp;rsquo;t work right sometimes. It&amp;rsquo;s always
in some sense operator error, but those errors happen often enough that we can also blame the
tool for providing a gun and aiming it at the operator&amp;rsquo;s foot. DNS is so often the culprit that
we even have websites like &lt;a href="https://isitdns.com"&gt;isitdns.com&lt;/a&gt; and &lt;a href="https://dnshaiku.com"&gt;dnshaiku.com&lt;/a&gt;. However, at Triple
Pat we want to build a reliable system that provides an external API. That API can be accessed in
two main ways: by email and by URL. Both methods depend on DNS. So how do we set up our DNS for
reliability?&lt;/p&gt;
&lt;p&gt;Well, first we have reliability through redundancy. Everything available through &lt;a href="https://triplepat.com"&gt;triplepat.com&lt;/a&gt;
is also available through &lt;a href="https://triplepat.net"&gt;triplepat.net&lt;/a&gt;. We have mirrored the records
so that if DNS seems down for one, try the other and it should hopefully be up. We enforce
the independence of these by buying the names and hosting the DNS through rival companies.
GoDaddy is in charge of registration and DNS operations for triplepat.com, and CloudFlare is
in charge of registrations and DNS operations for triplepat.net. These two companies are direct
competitors, and as such they are unlikely to share infrastructure. So we have achieved
independent redundant reliability via incentive-compatibility with large corporations.&lt;/p&gt;
&lt;p&gt;Next, for our mail operations to &lt;code&gt;uuid@checkin.triplepat.com&lt;/code&gt; and &lt;code&gt;uuid@checkin.triplepat.net&lt;/code&gt;,
we have MX-type DNS records specifying the mail servers for both &lt;code&gt;checkin.triplepat.net&lt;/code&gt; and &lt;code&gt;checkin.triplepat.com&lt;/code&gt;. Because we can
receive email at any of our mirrors, we make the MX record for each TLD refer to every one of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://a.triplepat.com"&gt;a.triplepat.com&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://a.triplepat.net"&gt;a.triplepat.net&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://b.triplepat.com"&gt;b.triplepat.com&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://b.triplepat.net"&gt;b.triplepat.net&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://c.triplepat.com"&gt;c.triplepat.com&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://c.triplepat.net"&gt;c.triplepat.net&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://d.triplepat.com"&gt;d.triplepat.com&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://d.triplepat.net"&gt;d.triplepat.net&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;and we have made the time-to-live (TTL) for the MX records be as long as each provider allows. This means that
email service is highly likely to keep working, even if one TLD goes down, because the long-lived
records mean that mail for one can be handled by the other, and that fact can be served out of
cache. So, once you have emails flowing, it is highly likely that they will be able to continue
to flow even through DNS disruptions.&lt;/p&gt;
&lt;p&gt;So we achieve DNS reliability by having multiple providers, and, for the MX (mail) records, by
having long-lived cache entries that allow each TLD to back up the other (or not). To see
this in action for our systems, use the &lt;code&gt;dig&lt;/code&gt; tool!&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-txt" data-lang="txt"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;$ dig checkin.triplepat.net MX
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;...
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;;; ANSWER SECTION:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.net. 86400 IN MX 10 a.triplepat.net.
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.net. 86400 IN MX 10 c.triplepat.com.
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.net. 86400 IN MX 10 b.triplepat.com.
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.net. 86400 IN MX 10 a.triplepat.com.
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.net. 86400 IN MX 10 b.triplepat.net.
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.net. 86400 IN MX 10 d.triplepat.net.
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.net. 86400 IN MX 10 d.triplepat.com.
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.net. 86400 IN MX 10 c.triplepat.net.
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;and&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-txt" data-lang="txt"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;$ dig triplepat.com MX
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;---
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;;; ANSWER SECTION:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.com. 86400 IN MX 10 a.triplepat.com.
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.com. 86400 IN MX 10 c.triplepat.com.
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.com. 86400 IN MX 10 d.triplepat.com.
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.com. 86400 IN MX 10 a.triplepat.net.
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.com. 86400 IN MX 10 b.triplepat.com.
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.com. 86400 IN MX 10 b.triplepat.net.
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.com. 86400 IN MX 10 d.triplepat.net.
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;checkin.triplepat.com. 86400 IN MX 10 c.triplepat.net.
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;From this we can see both sets of names in both responses. Our
DNS entries for a,b,c, and d are shorter-lived because we want
to be able to migrate servers in less than a day, but the
MX record means that the mail sender can failover between
TLDs transparently to you, the service user.&lt;/p&gt;
&lt;p&gt;No matter which TLD you use for the email, as long as the MX
record is in the cache (and to get in the cache, it needs to
have worked once in the past 24 hours), your email will get
routed to either of the two TLDs.&lt;/p&gt;
&lt;p&gt;Even better, it is highly likely that at least one TLD
is functioning properly, because the providers are independent.
This means that if one of them is functioning correctly, the service should
still work. If we find out that two are not enough, then we
can easily add a third TLD, but for now we feel pretty confident
that two independent DNS registrars/providers is enough to get
us the &lt;a href="https://en.wikipedia.org/wiki/Nines_%28notation%29#System_availability"&gt;nines&lt;/a&gt; we need to provide a reliable service.&lt;/p&gt;</description></item><item><title>Highly reliable Go code - use clocks not time</title><link>https://triplepat.com/blog/2025/02/28/clocks</link><pubDate>Fri, 28 Feb 2025 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2025/02/28/clocks</guid><description>&lt;p&gt;Q: How can you tell if someone is a programmer?&lt;/p&gt;
&lt;p&gt;A: Easy! Say &amp;ldquo;time zones&amp;rdquo; and see if they flinch!&lt;/p&gt;
&lt;p&gt;Time and clocks always sound straightforward when you don&amp;rsquo;t think too hard about them.
Unfortunately for us, the job of a programmer is to think hard about the things they are building.
And one of the hardest things to get right is testing code that depends on time.&lt;/p&gt;
&lt;p&gt;Imagine you have a cache that expires items after an hour, or a retry mechanism that waits
30 seconds between attempts. How do you test these without waiting? Your unit tests would
take forever to run!&lt;/p&gt;
&lt;p&gt;When working with Go, you often see &lt;code&gt;time.Sleep()&lt;/code&gt; and &lt;code&gt;time.Now()&lt;/code&gt;.
In C the equivalents might be &lt;code&gt;sleep()&lt;/code&gt; or &lt;code&gt;usleep()&lt;/code&gt; and &lt;code&gt;time()&lt;/code&gt;.
In C++ you should probably use &lt;code&gt;std::this_thread::sleep_for()&lt;/code&gt; and &lt;code&gt;system_clock::now()&lt;/code&gt;.
In Python it&amp;rsquo;s &lt;code&gt;datetime.datetime.now()&lt;/code&gt; and &lt;code&gt;time.sleep()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;In all of these functions, you consult data derived from the OS-managed hardware clock. Because that clock is
managed outside of your code, you don&amp;rsquo;t control it! Worse, you are up a creek when it comes to testing.
I would like to propose that, instead of using these global functions, you define an interface called &lt;code&gt;Clock&lt;/code&gt; and
provide at least two implementations: &lt;code&gt;SystemClock&lt;/code&gt; and &lt;code&gt;StaticClock&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;In Go this would be:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-golang" data-lang="golang"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;package&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;clocks&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;type&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;Clock&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;interface&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;Now&lt;/span&gt;() &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Time&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;Sleep&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Duration&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;type&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;SystemClock&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;struct&lt;/span&gt;{}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; (&lt;span style="color:#a6e22e"&gt;SystemClock&lt;/span&gt;) &lt;span style="color:#a6e22e"&gt;Now&lt;/span&gt;() &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Time&lt;/span&gt; { &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Now&lt;/span&gt;() }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; (&lt;span style="color:#a6e22e"&gt;SystemClock&lt;/span&gt;) &lt;span style="color:#a6e22e"&gt;Sleep&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;d&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Duration&lt;/span&gt;) { &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Sleep&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;d&lt;/span&gt;) }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;type&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;StaticClock&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;struct&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;Time&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Time&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; (&lt;span style="color:#a6e22e"&gt;sc&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;StaticClock&lt;/span&gt;) &lt;span style="color:#a6e22e"&gt;Now&lt;/span&gt;() &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Time&lt;/span&gt; { &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;sc&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Time&lt;/span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; (&lt;span style="color:#a6e22e"&gt;sc&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;StaticClock&lt;/span&gt;) &lt;span style="color:#a6e22e"&gt;Sleep&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;d&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Duration&lt;/span&gt;) { &lt;span style="color:#a6e22e"&gt;sc&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Time&lt;/span&gt; &lt;span style="color:#f92672"&gt;+=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;d&lt;/span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;With these two implementations, you can embed the clock in your code you want to test.
Which means that you can cause your code to use the static clock! All of its sleeps will be
instantaneous - you can verify using &lt;code&gt;Now()&lt;/code&gt; that time has passed, but you will never have
to wait for a &lt;code&gt;Sleep()&lt;/code&gt; to complete. They have been transformed into additions!&lt;/p&gt;
&lt;p&gt;To use this, you&amp;rsquo;ll need to change the objects you have that use time to have another
class member - the clock. Then at test time you can pass in a StaticClock and at runtime you
can pass in the SystemClock. By testing with clock objects, you can make your unit tests
both faster and significantly less flaky.&lt;/p&gt;
&lt;p&gt;Just for completeness, here&amp;rsquo;s clocks in Python:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;import&lt;/span&gt; datetime&lt;span style="color:#f92672"&gt;,&lt;/span&gt; time
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;class&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;SystemClock&lt;/span&gt;(object):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;def&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;now&lt;/span&gt;(self) &lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt; datetime&lt;span style="color:#f92672"&gt;.&lt;/span&gt;datetime:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; datetime&lt;span style="color:#f92672"&gt;.&lt;/span&gt;datetime&lt;span style="color:#f92672"&gt;.&lt;/span&gt;now()
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;def&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;sleep&lt;/span&gt;(self, seconds: float):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; time&lt;span style="color:#f92672"&gt;.&lt;/span&gt;sleep(seconds)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;class&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;StaticClock&lt;/span&gt;(object):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;def&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;__init__&lt;/span&gt;(self, currentTime: datetime&lt;span style="color:#f92672"&gt;.&lt;/span&gt;datetime):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; self&lt;span style="color:#f92672"&gt;.&lt;/span&gt;currentTime &lt;span style="color:#f92672"&gt;=&lt;/span&gt; currentTime
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;def&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;now&lt;/span&gt;(self) &lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt; datetime&lt;span style="color:#f92672"&gt;.&lt;/span&gt;datetime:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; self&lt;span style="color:#f92672"&gt;.&lt;/span&gt;currentTime
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;def&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;sleep&lt;/span&gt;(self, seconds: float):
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; self&lt;span style="color:#f92672"&gt;.&lt;/span&gt;currentTime &lt;span style="color:#f92672"&gt;+=&lt;/span&gt; datetime&lt;span style="color:#f92672"&gt;.&lt;/span&gt;timedelta(seconds&lt;span style="color:#f92672"&gt;=&lt;/span&gt;seconds)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Using these clocks, you can decide whether you want the OS to be in control of your code&amp;rsquo;s
perception of time, or whether you want to be in control. I find that in unit tests I
almost always want to be in control, while in production I want to hand control to the OS.&lt;/p&gt;
&lt;p&gt;This is called &amp;ldquo;dependency injection&amp;rdquo; when generalized, and can be of great help in many
contexts. In my opinion, clocks are one of the very best dependencies to inject.&lt;/p&gt;
&lt;h2 id="a-practical-example"&gt;A Practical Example&lt;/h2&gt;
&lt;p&gt;Let&amp;rsquo;s look at a real-world example. Suppose you have a cache that needs to expire items after a certain duration:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-golang" data-lang="golang"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;type&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;Cache&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;struct&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;clock&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;Clock&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;items&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;map&lt;/span&gt;[&lt;span style="color:#66d9ef"&gt;string&lt;/span&gt;]&lt;span style="color:#a6e22e"&gt;item&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;type&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;item&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;struct&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;value&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;string&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;expiresAt&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Time&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;NewCache&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;clock&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;Clock&lt;/span&gt;) &lt;span style="color:#f92672"&gt;*&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;Cache&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;Cache&lt;/span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;clock&lt;/span&gt;: &lt;span style="color:#a6e22e"&gt;clock&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;items&lt;/span&gt;: make(&lt;span style="color:#66d9ef"&gt;map&lt;/span&gt;[&lt;span style="color:#66d9ef"&gt;string&lt;/span&gt;]&lt;span style="color:#a6e22e"&gt;item&lt;/span&gt;),
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; (&lt;span style="color:#a6e22e"&gt;c&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;Cache&lt;/span&gt;) &lt;span style="color:#a6e22e"&gt;Set&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;key&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;string&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;value&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;string&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;ttl&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Duration&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;c&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;items&lt;/span&gt;[&lt;span style="color:#a6e22e"&gt;key&lt;/span&gt;] = &lt;span style="color:#a6e22e"&gt;item&lt;/span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;value&lt;/span&gt;: &lt;span style="color:#a6e22e"&gt;value&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;expiresAt&lt;/span&gt;: &lt;span style="color:#a6e22e"&gt;c&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;clock&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Now&lt;/span&gt;().&lt;span style="color:#a6e22e"&gt;Add&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;ttl&lt;/span&gt;),
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; (&lt;span style="color:#a6e22e"&gt;c&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;Cache&lt;/span&gt;) &lt;span style="color:#a6e22e"&gt;Get&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;key&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;string&lt;/span&gt;) (&lt;span style="color:#66d9ef"&gt;string&lt;/span&gt;, &lt;span style="color:#66d9ef"&gt;bool&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;item&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;exists&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;c&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;items&lt;/span&gt;[&lt;span style="color:#a6e22e"&gt;key&lt;/span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; !&lt;span style="color:#a6e22e"&gt;exists&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&amp;#34;&lt;/span&gt;, &lt;span style="color:#66d9ef"&gt;false&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;c&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;clock&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Now&lt;/span&gt;().&lt;span style="color:#a6e22e"&gt;After&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;item&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;expiresAt&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;&amp;#34;&lt;/span&gt;, &lt;span style="color:#66d9ef"&gt;false&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;item&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;value&lt;/span&gt;, &lt;span style="color:#66d9ef"&gt;true&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Now testing this cache becomes trivial:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-golang" data-lang="golang"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;TestCacheExpiration&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;t&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;testing&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;T&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;clock&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;StaticClock&lt;/span&gt;{&lt;span style="color:#a6e22e"&gt;Time&lt;/span&gt;: &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Now&lt;/span&gt;()}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;cache&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;NewCache&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;clock&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;cache&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Set&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;key&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;value&amp;#34;&lt;/span&gt;, &lt;span style="color:#ae81ff"&gt;5&lt;/span&gt;&lt;span style="color:#f92672"&gt;*&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Minute&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;// Advance time by 4 minutes&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;clock&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Sleep&lt;/span&gt;(&lt;span style="color:#ae81ff"&gt;4&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Minute&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;value&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;exists&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;cache&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Get&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;key&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; !&lt;span style="color:#a6e22e"&gt;exists&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;t&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Error&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;Item should still exist&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;// Advance time by 2 more minutes&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;clock&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Sleep&lt;/span&gt;(&lt;span style="color:#ae81ff"&gt;2&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;time&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Minute&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;_&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;exists&lt;/span&gt; = &lt;span style="color:#a6e22e"&gt;cache&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Get&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;key&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;exists&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;t&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Error&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;Item should have expired&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This test runs instantly, is deterministic, and clearly demonstrates the cache&amp;rsquo;s behavior. Without StaticClock, this test would take 6 minutes!&lt;/p&gt;
&lt;p&gt;In production, you would use:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-golang" data-lang="golang"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;cache&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;NewCache&lt;/span&gt;(&lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;SystemClock&lt;/span&gt;{})
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="key-benefits"&gt;Key Benefits&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Testability&lt;/strong&gt;: Tests run instantly and are deterministic&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Control&lt;/strong&gt;: You decide when and how time advances&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Isolation&lt;/strong&gt;: Your code is isolated from system clock changes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clarity&lt;/strong&gt;: Dependencies on time are explicit in your interfaces&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="use-it"&gt;Use it!&lt;/h2&gt;
&lt;p&gt;This pattern is your friend whenever you need to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Test code with timeouts or delays&lt;/li&gt;
&lt;li&gt;Verify behavior at specific times or dates&lt;/li&gt;
&lt;li&gt;Make flaky time-dependent tests reliable&lt;/li&gt;
&lt;li&gt;Speed up tests that would otherwise need to wait&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is not a new trick, but it&amp;rsquo;s very useful and it is under-used, and it can
make your life much better when you need to test code that relies on time.&lt;/p&gt;
&lt;p&gt;All code in this blog post is &lt;a href="https://creativecommons.org/public-domain/cc0/"&gt;CC0&lt;/a&gt;, which means you can freely use it however you want.&lt;/p&gt;
&lt;p&gt;Happy coding!&lt;/p&gt;</description></item><item><title>Reliable Systems via Good Metrics</title><link>https://triplepat.com/blog/2025/02/14/good-metrics</link><pubDate>Fri, 14 Feb 2025 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2025/02/14/good-metrics</guid><description>&lt;p&gt;When building and running reliable systems, you need data to answer three critical questions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Is the system working correctly now?&lt;/li&gt;
&lt;li&gt;How well has it been working over time?&lt;/li&gt;
&lt;li&gt;When did things start to go wrong?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Metrics help us answer these questions. Here we focus on Prometheus metrics and some best practices for them.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://prometheus.io/"&gt;Prometheus&lt;/a&gt; is a pull-based system for specifying, collecting, and saving metrics. It&amp;rsquo;s pretty easy to set up and very powerful. There are, however, some best practices around metrics and metric names that are easy to get wrong. Worse, changing metric names is surprisingly difficult, because metrics get used in alerts and dashboards, so a rename can easily break &amp;ldquo;downstream&amp;rdquo; systems. It is worth it to try and get your metrics right the first time.&lt;/p&gt;
&lt;p&gt;A subtle question that people only start asking themselves after using a metrics system for a bit is &amp;ldquo;What makes a good metric?&amp;rdquo; Metrics usually have names and values and types, and there are definitely better and worse ways to name, define, and use metrics in your code. As a quick example, if a metric tracks a duration, its name should end with &lt;code&gt;_seconds&lt;/code&gt;, and not &lt;code&gt;_ms&lt;/code&gt; or &lt;code&gt;_s&lt;/code&gt; or &lt;code&gt;_time&lt;/code&gt;. More on this later, but these kinds of rules are important, and are easy to get wrong.&lt;/p&gt;
&lt;p&gt;We can describe some best practices around metrics (and we will!) but manual review of these best practices is error-prone. When possible, use a linter to check our metrics. This post describes how to use the &lt;code&gt;promlint&lt;/code&gt; tool to check your metrics, and why you might really want to use it. We end with some metrics best-practices, including some that unfortunately promlint can&amp;rsquo;t check, but are good to keep in mind, and then exhort you to use &lt;code&gt;promlint&lt;/code&gt; as part of your unit tests.&lt;/p&gt;
&lt;h2 id="promlint-for-linting-metrics"&gt;&lt;code&gt;promlint&lt;/code&gt; for linting metrics&lt;/h2&gt;
&lt;p&gt;Use &lt;a href="https://pkg.go.dev/github.com/prometheus/client_golang/prometheus/testutil/promlint"&gt;&lt;code&gt;promlint&lt;/code&gt;&lt;/a&gt; to check your metrics. It&amp;rsquo;s great. When it complains, it is right and you should fix your metrics. They built a tool with a very low false-positive rate, so you can trust it.&lt;/p&gt;
&lt;h3 id="how-to-use-promlint"&gt;How to use &lt;code&gt;promlint&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The go library &lt;code&gt;promlint&lt;/code&gt; contains two methods of note: &lt;code&gt;New()&lt;/code&gt; and &lt;code&gt;Lint()&lt;/code&gt;. Because you want to check all the metrics for your application, you should lint your metrics in a unit test inside the &lt;code&gt;main&lt;/code&gt; package so that all the libraries &lt;code&gt;main()&lt;/code&gt; uses are included. &lt;a href="https://gist.github.com/pboothe/349f0d165139447826f5e0cedbfdebfc"&gt;Here&amp;rsquo;s how I use it to check my metrics&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Now you can test your own metrics, and if you violate the automatically-checkable best practices you will get an error. So let&amp;rsquo;s look at some of the best practices, both automatically-checkable and not.&lt;/p&gt;
&lt;h2 id="good-metric-names"&gt;Good metric names&lt;/h2&gt;
&lt;p&gt;A good metric name leaves no doubt about what the metric measures and where it comes from. Right away from this definition of good we get some rules:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A metric should be clear about where in the code it comes from.&lt;/li&gt;
&lt;li&gt;A metric name should specify the type of metric.&lt;/li&gt;
&lt;li&gt;A metric should be clear about what it measures.&lt;/li&gt;
&lt;li&gt;A metric name should include the unit of measurement.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Pleasantly, the Prometheus project &lt;a href="https://prometheus.io/docs/practices/naming/"&gt;agrees with these rules about naming metrics&lt;/a&gt;. This means that &lt;code&gt;promlint&lt;/code&gt; can check your metrics for compliance with those rules! As a quick precis, here are the rules:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Counter metric names (and only counter names) should end with &lt;code&gt;_total&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;All metric names should end with a unit of measurement (just before the &lt;code&gt;_total&lt;/code&gt; for counters)&lt;/li&gt;
&lt;li&gt;Units of measurement should be an SI unit without a multiplier. Use &lt;code&gt;_seconds&lt;/code&gt; for seconds and &lt;code&gt;_bytes&lt;/code&gt; for bytes, and avoid &lt;code&gt;_milliseconds&lt;/code&gt;, &lt;code&gt;_ms&lt;/code&gt;, &lt;code&gt;_us&lt;/code&gt;, &lt;code&gt;_ns&lt;/code&gt;, &lt;code&gt;_µs&lt;/code&gt;, &lt;code&gt;_mB&lt;/code&gt;, &lt;code&gt;_megabytes&lt;/code&gt;, and &lt;code&gt;_kb&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The linked page has more, but these are the important ones. &lt;code&gt;promlint&lt;/code&gt; will check your metrics for compliance with these rules as much as possible.&lt;/p&gt;
&lt;h2 id="good-metric-types"&gt;Good metric types&lt;/h2&gt;
&lt;p&gt;There are four types of metrics: counters, gauges, summaries, and histograms. There&amp;rsquo;s a few more options, but they are more obscure and are easy to get wrong. You should stick to gauges, counters, and histograms, and really you should try to stick to just counters and histograms.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Counters for metrics that are monotonically increasing.&lt;/li&gt;
&lt;li&gt;Histograms for metrics that need to track a distribution of values.&lt;/li&gt;
&lt;li&gt;Summaries are like histograms, but they are more complex and have more pitfalls. Prefer the simplicity of histograms.&lt;/li&gt;
&lt;li&gt;Gauges for metrics that can go up and down. Use sparingly because they have many subtle pitfalls.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="counters"&gt;Counters&lt;/h3&gt;
&lt;p&gt;A critical thing to understand about metrics is that they are collected &lt;em&gt;when the collector wants&lt;/em&gt;. Your code has no influence on how often the Prometheus collector comes around to collect metrics. This means you need to design your metrics to be informative even when sampled at very different rates. The most common way to do this is to basically always use counters. Increments of a counter are never missed, the collection rate just determines how coarse the sampling rate is. With counters you will never miss a measurement.&lt;/p&gt;
&lt;h3 id="histograms"&gt;Histograms&lt;/h3&gt;
&lt;p&gt;Histograms are actually counters under the hood, so everything good about counters is also good about histograms. Histograms never miss a measurement, and the collection rate just determines how deep the metrics user can zoom into the data. If they sample every minute, then they can zoom in and see changes every minute. If they sample every hour, then they can&amp;rsquo;t zoom in as much. Good histogram names are clear about what they measure, and include the unit of measurement, for example &lt;code&gt;http_request_duration_seconds&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;re measuring HTTP request durations with &lt;code&gt;http_request_duration_seconds&lt;/code&gt;, the histogram will create multiple counters like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;http_request_duration_seconds_bucket{le=&amp;quot;0.1&amp;quot;}&lt;/code&gt; (total requests under 0.1s)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;http_request_duration_seconds_bucket{le=&amp;quot;0.5&amp;quot;}&lt;/code&gt; (total requests under 0.5s)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;http_request_duration_seconds_sum&lt;/code&gt; (total duration of all requests)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;http_request_duration_seconds_count&lt;/code&gt; (total number of requests)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now you can see how histograms are a cousin of counters, because they literally are a collection of counters, and everything good about counters is also good about histograms.&lt;/p&gt;
&lt;h3 id="summaries"&gt;Summaries&lt;/h3&gt;
&lt;p&gt;Summary metrics are a cousin of histograms, but they are more clever internally. They attempt to adaptively track the median value of a metric (or the 95%ile, or some other percentile), but the algorithm for reporting a median is either memory intensive (if they save all measurements and report the exact median at scrape time) or imprecise (if they use a streaming algorithm). The mathematical subtleties are beyond the scope of this post, but the bottom line is that you should strongly prefer histograms. The simplicity of histograms is worth a lot.&lt;/p&gt;
&lt;h3 id="gauges"&gt;Gauges&lt;/h3&gt;
&lt;p&gt;Gauges are the most problematic type of metric, because there are no rules and therefore no guarantees about gauge values. Gauges have no history, only a current value. A gauge that spikes for 1 second every minute at 15 seconds after the minute will either look like it is constantly zero or constantly spiking when it is collected every minute. Worse, whether it looks like zero or a spike depends on when the &lt;em&gt;collector&lt;/em&gt; started! Gauges are best used for information that changes very slowly. I use gauges to track the git commit that the server binary was built from, because that number changes changes very slowly (it only changes when I build and deploy a new binary).&lt;/p&gt;
&lt;p&gt;Also, you want to be careful to try and use metrics that aren&amp;rsquo;t ratios. Export numerators and denominators separately. Don&amp;rsquo;t export a gauge of &lt;code&gt;http_requests_per_minute&lt;/code&gt;, export &lt;code&gt;http_requests_total&lt;/code&gt; and let the user do the division. This gives the user more flexibility (turning requests per minute into requests per hour or per second seems easy but is not) and avoids issues when the denominator is zero. In general if the word &amp;ldquo;&lt;code&gt;per&lt;/code&gt;&amp;rdquo; is in the metric name, you are probably doing something wrong, especially if the denominator is a time unit. Export a counter or histogram for the numerator, and then the user can do the rest.&lt;/p&gt;
&lt;p&gt;If you want to keep track of a current allocation, it may be tempting to use a gauge. But this ends up having the sampling problem described earlier. Better is to use two counters, one for allocations and for deallocations. The difference between the two counters is the current allocation. This way you don&amp;rsquo;t have the sampling problem, and you don&amp;rsquo;t have to worry about missing massive temporary spikes in allocations. Count packets and bytes sent and received instead of reporting bps. Instead of measuring how much water is in a bucket, measure how much water has been added to the bucket and how much has been taken out.&lt;/p&gt;
&lt;p&gt;For example, instead of a gauge like &lt;code&gt;memory_usage_bytes&lt;/code&gt;, consider using two counters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;memory_allocations_bytes_total&lt;/code&gt; (total memory allocated)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;memory_deallocations_bytes_total&lt;/code&gt; (total memory freed)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This gives you both the current usage (by subtracting) and the historical allocation/deallocation volumes between samples, which a gauge would miss.&lt;/p&gt;
&lt;p&gt;By keeping your metrics simple and sample-rate agnostic, you and your metric&amp;rsquo;s users can change the granularity or adapt them to new situations. This is a huge win. I&amp;rsquo;ve lost weeks of my professional life to bad metrics in the past, and I&amp;rsquo;d like to help you avoid that.&lt;/p&gt;
&lt;h2 id="summary"&gt;Summary&lt;/h2&gt;
&lt;p&gt;Metrics are important and good!&lt;/p&gt;
&lt;p&gt;Tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;promlint&lt;/code&gt; to catch common metric mistakes&lt;/li&gt;
&lt;li&gt;Run metric checks as part of your unit tests&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Naming:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your metric name should be clear about what it measures and where it comes from&lt;/li&gt;
&lt;li&gt;Put units in your metric names&lt;/li&gt;
&lt;li&gt;Use SI units without multipliers&lt;/li&gt;
&lt;li&gt;Ensure counter names end in &lt;code&gt;_total&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Types and Usage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prefer counters and histograms over gauges&lt;/li&gt;
&lt;li&gt;Prefer histograms over summaries&lt;/li&gt;
&lt;li&gt;Design your metrics so that you never want to have &amp;ldquo;per&amp;rdquo; in a metric name&lt;/li&gt;
&lt;li&gt;Track changes (like total water added and removed) rather than states (current water level)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="use-it"&gt;Use it!&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;promlint&lt;/code&gt; tool is great, and you should use it! It will catch some mistakes, and thoughtfully addressing its output will make you think about your metrics in a good way. It is also easy to set up, and you can run it as part of your unit tests.&lt;/p&gt;
&lt;p&gt;Along the way, please think deeply about the metric names and types you define and use, and prefer Counters and Histograms. Metrics are a very &amp;ldquo;sticky&amp;rdquo; part of a project, so getting them right early is a great way to avoid much bigger headaches later.&lt;/p&gt;</description></item><item><title>Highly reliable Go code - Don't ignore errors when you defer, use this pattern instead</title><link>https://triplepat.com/blog/2025/02/04/defererrors</link><pubDate>Tue, 04 Feb 2025 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2025/02/04/defererrors</guid><description>&lt;p&gt;Continuing our series on highly reliable Go code, here is a nice pattern for handling errors when you &lt;code&gt;defer&lt;/code&gt;. Let&amp;rsquo;s start with a common problem. If you are building a Go service, you might have code that looks like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;s&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;http&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Server&lt;/span&gt;{ &lt;span style="color:#75715e"&gt;/* ...setup... */&lt;/span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// This returns an error we&amp;#39;re ignoring!&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;defer&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;s&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Shutdown&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;s&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;ListenAndServe&lt;/span&gt;()
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="improving-httpserver-usage"&gt;Improving &lt;code&gt;http.Server&lt;/code&gt; usage&lt;/h2&gt;
&lt;p&gt;There&amp;rsquo;s a problem here: two functions return &lt;code&gt;error&lt;/code&gt;, and we&amp;rsquo;re ignoring both of them! We can do better.&lt;/p&gt;
&lt;p&gt;For &lt;code&gt;ListenAndServe&lt;/code&gt;, the fix is straightforward — we should use &lt;a href="https://triplepat.com/blog/2025/01/27/logonerror"&gt;&lt;code&gt;LogOnError&lt;/code&gt;&lt;/a&gt;. You can go to the linked post to learn more, but the key is that &lt;code&gt;LogOnError&lt;/code&gt; is a function that takes an error and logs it if it&amp;rsquo;s not nil. The deferred &lt;code&gt;Shutdown&lt;/code&gt; is trickier though. Here&amp;rsquo;s why: when using &lt;code&gt;defer&lt;/code&gt;, all arguments to the function are evaluated immediately, but the function call itself is delayed until the surrounding function returns. So this won&amp;rsquo;t work:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// Wrong! Shutdown runs immediately&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;defer&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;LogOnError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;s&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Shutdown&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt;))
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The &lt;code&gt;Shutdown&lt;/code&gt; is called immediately, and only the &lt;code&gt;LogOnError&lt;/code&gt; call is deferred. That&amp;rsquo;s not what we want! We want both the shutdown and its error handling to happen when the function returns.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s a cleaner solution using a new interface:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;type&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;ShutdownWithError&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;interface&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;Shutdown&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;context&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Context&lt;/span&gt;) &lt;span style="color:#66d9ef"&gt;error&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;ShutdownWithError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;context&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Context&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;s&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;ShutdownWithError&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;LogOnError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;s&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Shutdown&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt;))
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Now we can write:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;defer&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;ShutdownWithError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;s&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;LogOnError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;s&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;ListenAndServe&lt;/span&gt;())
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This gives us exactly what we want: proper shutdown timing and error handling. The trick here is that Go&amp;rsquo;s &lt;code&gt;http.Server&lt;/code&gt; automatically implements our &lt;code&gt;ShutdownWithError&lt;/code&gt; interface, even though its authors never knew about it. This is one of Go&amp;rsquo;s powerful features — we can declare new interfaces that existing types automatically implement.&lt;/p&gt;
&lt;h2 id="improving-database-transaction-usage"&gt;Improving Database Transaction Usage&lt;/h2&gt;
&lt;p&gt;This pattern is particularly useful for database transactions. Best practices dictate that you should always call &lt;code&gt;Rollback&lt;/code&gt; on transactions — it&amp;rsquo;s either a no-op (when the transaction is already committed) or it prevents resource leaks. The typical pattern for transaction code is:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;tx&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;db&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;BeginTx&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt;, &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#f92672"&gt;!=&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// This returns an error we&amp;#39;re ignoring!&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;defer&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;tx&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Rollback&lt;/span&gt;()
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// ... do database stuff using the transaction...&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;tx&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Commit&lt;/span&gt;()
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;But &lt;code&gt;Rollback&lt;/code&gt; returns an error, and we&amp;rsquo;re ignoring it! We can apply the same pattern as before:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;type&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;RollbackWithError&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;interface&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;Rollback&lt;/span&gt;() &lt;span style="color:#66d9ef"&gt;error&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;RollbackWithError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;r&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;RollbackWithError&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;// Ignore ErrTxDone, it just means things were already committed&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;LogOnError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;r&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Rollback&lt;/span&gt;(), &lt;span style="color:#a6e22e"&gt;sql&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;ErrTxDone&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This new function, &lt;code&gt;RollbackWithError&lt;/code&gt;, lets us write cleaner rollback code that no longer ignores errors:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;tx&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;db&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;BeginTx&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;ctx&lt;/span&gt;, &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#f92672"&gt;!=&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;defer&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;RollbackWithError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;tx&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// ... do database stuff using the transaction...&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;tx&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Commit&lt;/span&gt;()
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="use-it"&gt;Use it!&lt;/h2&gt;
&lt;p&gt;This pattern does more than just make linters happy — it makes your code more reliable. You won&amp;rsquo;t silently ignore errors, but you also don&amp;rsquo;t have to change the flow of your code to get that benefit. One interface and one function, and you get the safe behavior you want. While you don&amp;rsquo;t need to handle every error, you should never be in a situation where errors occur without your knowledge. This pattern ensures you&amp;rsquo;ll know when things go wrong, even when the errors aren&amp;rsquo;t otherwise handled.&lt;/p&gt;
&lt;p&gt;If you liked this, I recommend also checking out &lt;a href="https://triplepat.com/blog/2025/01/27/logonerror"&gt;&lt;code&gt;LogOnError&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://triplepat.com/blog/2025/01/27/logonerror"&gt;&lt;code&gt;Must&lt;/code&gt;&lt;/a&gt;, which are related techniques that make it easier to safely handle errors succinctly. All code in this blog post is &lt;a href="https://creativecommons.org/public-domain/cc0/"&gt;CC0&lt;/a&gt;, which means you can freely use it however you want.&lt;/p&gt;
&lt;p&gt;Happy coding!&lt;/p&gt;</description></item><item><title>Highly reliable Go code - Log on error</title><link>https://triplepat.com/blog/2025/01/27/logonerror</link><pubDate>Mon, 27 Jan 2025 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2025/01/27/logonerror</guid><description>&lt;p&gt;If you are writing Go, then you spend a lot of time handling errors or ignoring errors. Ignoring errors is a code smell, but it is sometimes the right thing to do. This post describes a function I&amp;rsquo;ve found useful for handling less-important errors while keeping both the linter and logging systems happy. It&amp;rsquo;s almost too small for a blog post, but it&amp;rsquo;s so useful that I wanted to share it.&lt;/p&gt;
&lt;h2 id="the-problem"&gt;The Problem&lt;/h2&gt;
&lt;p&gt;A lot of Go functions return an error value along with a result. If you know that the error should always be nil, you can use &lt;a href="https://triplepat.com/blog/2025/01/20/must/"&gt;Must&lt;/a&gt; to crash when the error is not nil. Sometimes, though, ignoring the error is almost completely fine! This can happen, for example, when you call &lt;code&gt;.Close()&lt;/code&gt; on a connection — the close might fail or it might succeed, but either way you are done with the connection.&lt;/p&gt;
&lt;p&gt;When the linter tells you to pay attention to an unhandled error that you know isn&amp;rsquo;t important, you&amp;rsquo;re in a bit of a bind. Ignoring errors is a code smell, and so is disabling the linter. What can you do?&lt;/p&gt;
&lt;h2 id="the-evolution-of-a-solution"&gt;The Evolution of a Solution&lt;/h2&gt;
&lt;p&gt;An obvious solution is to log the error. This leads to dozens of lines like this throughout your codebase:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#f92672"&gt;!=&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;slog&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Error&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;error closing connection&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;error&amp;#34;&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This works but quickly becomes repetitive and clutters your code, and becomes a testability nightmare. Let&amp;rsquo;s improve this step by step.&lt;/p&gt;
&lt;h3 id="step-1-basic-implementation"&gt;Step 1: Basic Implementation&lt;/h3&gt;
&lt;p&gt;First we make a simple helper function:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;LogOnError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;error&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#f92672"&gt;!=&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;slog&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Error&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;unhandled error&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;error&amp;#34;&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This initial version saves us from writing the same code over and over again and provides a central point for error handling. This is usually enough, but some APIs return &amp;ldquo;normal&amp;rdquo; errors that shouldn&amp;rsquo;t be logged. We can improve our function to handle these cases.&lt;/p&gt;
&lt;h3 id="step-2-adding-exception-handling"&gt;Step 2: Adding Exception Handling&lt;/h3&gt;
&lt;p&gt;Sometimes certain errors are expected and shouldn&amp;rsquo;t be logged. For example, getting &lt;code&gt;io.EOF&lt;/code&gt; from a connection isn&amp;rsquo;t really an error. We can improve our function to handle these cases:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;LogOnError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;error&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;exceptions&lt;/span&gt; &lt;span style="color:#f92672"&gt;...&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;error&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;for&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;_&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;exception&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;range&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;exceptions&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;errors&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Is&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;exception&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;slog&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Error&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;unhandled error&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;error&amp;#34;&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Now we can explicitly specify which errors we expect and want to ignore:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;rtx&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;LogOnError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;os&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Remove&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;filename&lt;/span&gt;), &lt;span style="color:#a6e22e"&gt;os&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;ErrNotExist&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This removes the file if it exists, and doesn&amp;rsquo;t log anything if it doesn&amp;rsquo;t. If the remove fails for some other reason, then a log message is produced.&lt;/p&gt;
&lt;h3 id="step-3-adding-stack-traces"&gt;Step 3: Adding Stack Traces&lt;/h3&gt;
&lt;p&gt;As a final improvement, we can make debugging easier by logging the source of the error:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;LogOnError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;error&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;exceptions&lt;/span&gt; &lt;span style="color:#f92672"&gt;...&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;error&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;for&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;_&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;exception&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;range&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;exceptions&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;errors&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Is&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;exception&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;// If we get here, we have an unhandled error.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;_&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;file&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;line&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;ok&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;runtime&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Caller&lt;/span&gt;(&lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;ok&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;slog&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Error&lt;/span&gt;(
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;unhandled error&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;error&amp;#34;&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Error&lt;/span&gt;(),
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;file&amp;#34;&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;file&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;line&amp;#34;&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;line&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; )
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; } &lt;span style="color:#66d9ef"&gt;else&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;slog&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Error&lt;/span&gt;(
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;unhandled error (no debug information)&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;error&amp;#34;&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Error&lt;/span&gt;(),
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; )
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Now we&amp;rsquo;ll know exactly where each unhandled error originates, making debugging much more straightforward.&lt;/p&gt;
&lt;h2 id="real-world-examples"&gt;Real-World Examples&lt;/h2&gt;
&lt;p&gt;Here are some common scenarios where &lt;code&gt;LogOnError&lt;/code&gt; shines:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;HTTP Response Writing&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// Encoding JSON - errors here usually mean the client disconnected&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;rtx&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;LogOnError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;encoder&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Encode&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;response&lt;/span&gt;))
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ol start="2"&gt;
&lt;li&gt;&lt;strong&gt;File Operations&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// Closing files after we&amp;#39;re done with them&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;rtx&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;LogOnError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;file&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Close&lt;/span&gt;())
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// Removing temporary files - don&amp;#39;t worry if they&amp;#39;re already gone&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;rtx&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;LogOnError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;os&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Remove&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;tempFile&lt;/span&gt;), &lt;span style="color:#a6e22e"&gt;os&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;ErrNotExist&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ol start="3"&gt;
&lt;li&gt;&lt;strong&gt;Network Operations&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// Closing network connections&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;rtx&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;LogOnError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;conn&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Close&lt;/span&gt;())
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ol start="4"&gt;
&lt;li&gt;&lt;strong&gt;Database Operations&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// Closing database transactions&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;rtx&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;LogOnError&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;tx&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Rollback&lt;/span&gt;())
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="benefits"&gt;Benefits&lt;/h2&gt;
&lt;p&gt;Using &lt;code&gt;LogOnError&lt;/code&gt; provides several advantages:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Cleaner Code&lt;/strong&gt;: Reduces boilerplate while maintaining proper error handling&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Better Observability&lt;/strong&gt;: All unexpected errors are logged with their location&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Centralized Control&lt;/strong&gt;: Single point to modify error handling behavior&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Linter Friendly&lt;/strong&gt;: Satisfies the linter&amp;rsquo;s requirements for error handling&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexible&lt;/strong&gt;: Easy to customize for different types of errors or logging needs&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="use-it"&gt;Use it!&lt;/h2&gt;
&lt;p&gt;This pattern has become an essential part of my Go toolkit. The beauty of this approach is that it&amp;rsquo;s simple enough to implement in minutes, yet powerful enough to significantly improve your application&amp;rsquo;s readability. Handling errors can be like &amp;ldquo;eating your vegetables&amp;rdquo; — it&amp;rsquo;s good for you, but sometimes you don&amp;rsquo;t want to. This makes that handling much easier.&lt;/p&gt;
&lt;p&gt;All code in this post is available with a &lt;a href="https://creativecommons.org/public-domain/cc0/"&gt;CC:0 license&lt;/a&gt;. You can use it in your own projects and modify it to fit your needs. No credit needed, but I&amp;rsquo;d love to &lt;a href="https://triplepat.com/contact/"&gt;hear from you&lt;/a&gt; if you find patterns like this useful. In my experience, reliable systems come from repeatedly applying small improvements like this until the code is so clear that all problems are obvious.&lt;/p&gt;
&lt;p&gt;If you found this helpful, you might also enjoy my post about &lt;a href="https://triplepat.com/blog/2025/01/20/must/"&gt;&lt;code&gt;Must&lt;/code&gt; and &lt;code&gt;ValueOrDie&lt;/code&gt;&lt;/a&gt; for easily handling unrecoverable errors. A common thread between these two is that when the mental &amp;ldquo;speed bump&amp;rdquo; of adding an &lt;code&gt;if&lt;/code&gt; check for errors is removed, Go code becomes easier to read, understand, and write. It kind of doesn&amp;rsquo;t matter that the check is trivial, it still takes up brain space, and reclaiming that brain space is a good thing.&lt;/p&gt;
&lt;p&gt;This isn&amp;rsquo;t big and it isn&amp;rsquo;t rocket science, but it is a small improvement that can make a big difference when used consistently, and it saves you from writing a lot of boilerplate code. Use it in good health, and happy coding!&lt;/p&gt;</description></item><item><title>Highly reliable Go code - Must and ValueOrDie</title><link>https://triplepat.com/blog/2025/01/20/must</link><pubDate>Mon, 20 Jan 2025 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2025/01/20/must</guid><description>&lt;p&gt;With &lt;code&gt;Must&lt;/code&gt; and &lt;code&gt;ValueOrDie&lt;/code&gt;, you can make error handling in Go a lot easier, clearer, and safer. Even better, this isn&amp;rsquo;t a new idea; it&amp;rsquo;s been around for a long time and is used in many places. You should use it too!&lt;/p&gt;
&lt;p&gt;When writing Go code, a lot of methods can return an &lt;a href="https://pkg.go.dev/builtin#error"&gt;error&lt;/a&gt; value. Best practices dictate
that you not ignore this error (or any other return value), but instead do one
thing if the function succeeded and another if it returned a non-nil error. Because of this, Go code
has sometimes been accused of just being &lt;code&gt;if err != nil { ... }&lt;/code&gt; over and over again.&lt;/p&gt;
&lt;p&gt;Sometimes, however, there is nothing that can be done when there is an error. Your program
can&amp;rsquo;t safely go on, and there&amp;rsquo;s no way of fixing the problem. Other times, the error is actually impossible, but the compiler
can&amp;rsquo;t prove that. For both cases, people have repeatedly come up with the same solution, and have even named it the same thing: &lt;code&gt;Must()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The first instance I ever saw was &lt;a href="https://pkg.go.dev/regexp#MustCompile"&gt;&lt;code&gt;MustCompile&lt;/code&gt; in the regexp library&lt;/a&gt;. The regexp
compiler returns a compiled regular expression, and an error in case
compilation fails. But sometimes you are compiling a static string! In that
case, if compilation ever succeeds it will always succeed, and if it fails your
code is wrong and the program should crash.&lt;/p&gt;
&lt;h2 id="must-and-its-sibling-valueordie"&gt;&lt;code&gt;Must&lt;/code&gt; and its sibling &lt;code&gt;ValueOrDie&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;The regexp library authors made &lt;a href="https://pkg.go.dev/regexp#MustCompile"&gt;&lt;code&gt;regexp.MustCompile()&lt;/code&gt;&lt;/a&gt; for crashing when a regexp failed to compile. There are also other &lt;code&gt;MustX()&lt;/code&gt; functions sprinkled around the standard library and in other peoples&amp;rsquo; code. The name was evocative, and ever since, people have also proposed making a generic &lt;code&gt;Must()&lt;/code&gt; function and put their own &lt;code&gt;Must()&lt;/code&gt; functions as helpers. There&amp;rsquo;s lots of versions with slight variations, but the simplest one is easiest to understand:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;Must&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;error&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#f92672"&gt;!=&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; panic(&lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Thanks to Go generics, we can add its twin&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;ValueOrDie&lt;/span&gt;[&lt;span style="color:#a6e22e"&gt;T&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;any&lt;/span&gt;](&lt;span style="color:#a6e22e"&gt;val&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;T&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;error&lt;/span&gt;) &lt;span style="color:#a6e22e"&gt;T&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#f92672"&gt;!=&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; panic(&lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;val&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;With these two functions, setting up your systems in &lt;code&gt;main()&lt;/code&gt; is a lot
easier! Do you need to successfully open a file?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;file&lt;/span&gt; = &lt;span style="color:#a6e22e"&gt;ValueOrDie&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;os&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Open&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;filename&lt;/span&gt;))
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;does the job! Do you need to open a server on a port for serving a service?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;listener&lt;/span&gt; &lt;span style="color:#f92672"&gt;:=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;ValueOrDie&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;net&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Listen&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;tcp&amp;#34;&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;address&lt;/span&gt;))
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;works great! Do you need a string to unmarshal as json successfully?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;Must&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;json&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Unmarshal&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;scanner&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Bytes&lt;/span&gt;(), &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;&lt;span style="color:#a6e22e"&gt;received&lt;/span&gt;))
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;to the rescue! In each of these cases, the error is fatal, and the program
should crash. The &lt;code&gt;Must()&lt;/code&gt; and &lt;code&gt;ValueOrDie()&lt;/code&gt; functions make this easy.&lt;/p&gt;
&lt;h2 id="functions-as-folklore"&gt;Functions as folklore&lt;/h2&gt;
&lt;p&gt;These functions are so useful that they keep getting written as an internal tool
in libraries. Most recently I found a &lt;code&gt;Must()&lt;/code&gt; in the &lt;a href="https://pkg.go.dev/github.com/google/uuid"&gt;uuid
library&lt;/a&gt;. It only works on UUIDs,
but it&amp;rsquo;s &lt;a href="https://github.com/google/uuid/blob/2d3c2a9cc518326daf99a383f07c4d3c44317e4d/uuid.go#L210"&gt;defined exactly like
ValueOrDie&lt;/a&gt;
and used in many important and popular library functions (e.g.
&lt;a href="https://github.com/google/uuid/blob/2d3c2a9cc518326daf99a383f07c4d3c44317e4d/version4.go#L13"&gt;&lt;code&gt;New()&lt;/code&gt;&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Because this pair of little functions is repeatedly and similarly defined in many places, I would
encourage you to put them in a standard place in your own projects and use them liberally. They&amp;rsquo;ve passed the test of time! They also
really help when targeting high code coverage, because useless error code paths are
covered by the testing of &lt;code&gt;Must()&lt;/code&gt; rather than your tests being required to
invoke the error condition. You know for sure that if your program got to a
particular line, then the previous commands must have succeeded.&lt;/p&gt;
&lt;p&gt;When I&amp;rsquo;m feeling whimsical, I think of these functions as &lt;a href="https://en.wikipedia.org/wiki/Anthropic_principle"&gt;the anthropic
principle&lt;/a&gt; applied to code.
They mean that your program ran successfully at the end because it got to the end, which means it ran successfully! Whimsy aside, this
also means that your program is abiding by crash-on-error principles in some
important parts, and crash-on-error programs have a history of being better in
many contexts because they don&amp;rsquo;t try and stumble forward when things are
broken; they immediately crash. &lt;em&gt;&lt;a href="https://pragprog.com/titles/tpp20/the-pragmatic-programmer-20th-anniversary-edition/"&gt;The Pragmatic Programmer&lt;/a&gt;&lt;/em&gt; agrees! Tip 38 is
&amp;ldquo;&lt;a href="https://pragprog.com/tips/"&gt;Crash Early. A dead program normally does a lot less damage than a crippled
one.&lt;/a&gt;&amp;rdquo;&lt;/p&gt;
&lt;p&gt;I highly recommend using a linter to make sure you never ignore a returned
error, and then that you use &lt;code&gt;Must&lt;/code&gt; and &lt;code&gt;ValueOrDie&lt;/code&gt; throughout for errors that
are impossible or unrecoverable. You save yourself the trouble of testing
those error conditions, and people reading your code don&amp;rsquo;t have to worry that
something might not have worked. It also follows the next Pragmatic Programmer
tip! Tip 39: &amp;ldquo;&lt;a href="https://pragprog.com/tips/"&gt;Use Assertions to Prevent the Impossible. If it can’t happen, use assertions to ensure that it won’t. Assertions validate your assumptions. Use them to protect your code from an uncertain world.&lt;/a&gt;&amp;rdquo;&lt;/p&gt;
&lt;p&gt;When advice from 20 years ago is being actively used throughout many modern codebases and being constantly re-invented based on need, then I think they were really on to something. If you want your code to be reliable, make sure errors don&amp;rsquo;t happen. The only way to be sure they don&amp;rsquo;t is to either handle the errors (which makes them not erroneous, but instead an understood condition) or to crash (which ensures that your program will not erroneously try and make progress). Either way, you need to check the success condition, so turn on that linter and get to it!&lt;/p&gt;
&lt;h2 id="implications-for-the-larger-system"&gt;Implications for the larger system&lt;/h2&gt;
&lt;p&gt;The only real requirement after this is a system that notices the crashes and
&amp;ldquo;does the right thing&amp;rdquo;. In all cases, the number of crashes should be tracked,
but sometimes the right thing is to allow the crash to take a larger system out
of production, and other times the right thing to do is to restart the crashed
program (&lt;a href="https://www.youtube.com/watch?v=DPqdyoTpyEs"&gt;&amp;ldquo;Have you tried turning it off and on
again?&amp;rdquo;&lt;/a&gt; at scale).&lt;/p&gt;
&lt;p&gt;Either way, you can put your system on a better-designed and better-tested path
by liberally using &lt;code&gt;Must(...)&lt;/code&gt; and &lt;code&gt;ValueOrDie(...)&lt;/code&gt;. These functions help assert that neither unhandleable errors nor inconceivable conditions are occurring.&lt;/p&gt;
&lt;h2 id="implementing-must-and-valueordie"&gt;Implementing &lt;code&gt;Must&lt;/code&gt; and &lt;code&gt;ValueOrDie&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;By this point, I think we can agree that these functions are &amp;ldquo;folklore&amp;rdquo;. Implemented in many places, often with tiny local variations, but all basically the same. When I wrote open-source Go
code, I put these functions in a library called
&lt;a href="https://github.com/m-lab/go/blob/main/rtx/rtx.go#L32"&gt;&lt;code&gt;rtx&lt;/code&gt;&lt;/a&gt; (&amp;ldquo;run-time
extensions&amp;rdquo;). However, I now think that version is a bit overwrought. It tries too hard to be pretty and clever with its output.&lt;/p&gt;
&lt;p&gt;The one
I currently use in my projects is:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-go" data-lang="go"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;package&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;rtx&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;import&lt;/span&gt; (
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;os&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;log/slog&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;runtime/debug&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// We allow injection for testing.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;var&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;osExit&lt;/span&gt; = &lt;span style="color:#a6e22e"&gt;os&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Exit&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;die&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;error&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;debug&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;PrintStack&lt;/span&gt;()
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;slog&lt;/span&gt;.&lt;span style="color:#a6e22e"&gt;Error&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;fatal error&amp;#34;&lt;/span&gt;, &lt;span style="color:#e6db74"&gt;&amp;#34;error&amp;#34;&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;osExit&lt;/span&gt;(&lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;Must&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;error&lt;/span&gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#f92672"&gt;!=&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;die&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;func&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;ValueOrDie&lt;/span&gt;[&lt;span style="color:#a6e22e"&gt;T&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;any&lt;/span&gt;](&lt;span style="color:#a6e22e"&gt;obj&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;T&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;error&lt;/span&gt;) &lt;span style="color:#a6e22e"&gt;T&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;err&lt;/span&gt; &lt;span style="color:#f92672"&gt;!=&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;nil&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;die&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;err&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;obj&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;I prefer this version for two reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;It calls &lt;code&gt;os.Exit(1)&lt;/code&gt;, which is unrecoverable, rather than calling &lt;code&gt;panic()&lt;/code&gt;, which is recoverable. These functions are for unrecoverable situations. By using an unrecoverable function we remove the temptation to try and use this for fancy flow control.&lt;/li&gt;
&lt;li&gt;It prints the stack trace and a simple &lt;code&gt;slog&lt;/code&gt; message. I&amp;rsquo;m a big fan of &lt;a href="https://go.dev/blog/slog"&gt;structured logging&lt;/a&gt; and this does that, but it doesn&amp;rsquo;t get too precious about it by somehow trying to convert a stack trace into a structured log. Also, because we exit instead of panicking, it&amp;rsquo;s nice to print the stack trace because we don&amp;rsquo;t get it by default.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;You might have slightly different preferences and that&amp;rsquo;s fine! Each teller tells their own version of
a folklore story, and that&amp;rsquo;s part of the charm.&lt;/p&gt;
&lt;h2 id="when-to-use-must-and-valueordie"&gt;When to use &lt;code&gt;Must&lt;/code&gt; and &lt;code&gt;ValueOrDie&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;✅ &lt;strong&gt;Good Use Cases:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Program initialization and setup
&lt;ul&gt;
&lt;li&gt;Loading configuration files&lt;/li&gt;
&lt;li&gt;Opening required system resources&lt;/li&gt;
&lt;li&gt;Compiling static regular expressions&lt;/li&gt;
&lt;li&gt;Parsing known-valid static data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Test setup where failure means the test itself is broken&lt;/li&gt;
&lt;li&gt;Situations where a returned error can not occur&lt;/li&gt;
&lt;li&gt;Situations where recovery is impossible or meaningless&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;❌ &lt;strong&gt;Avoid Using For:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Network operations that can be retried&lt;/li&gt;
&lt;li&gt;Any runtime data that could legitimately fail&lt;/li&gt;
&lt;li&gt;Any time where handling the error is desirable&lt;/li&gt;
&lt;li&gt;Errors that don&amp;rsquo;t matter&lt;/li&gt;
&lt;li&gt;Situations where partial failure is acceptable&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key question to ask yourself: &amp;ldquo;If this fails, does it make sense to continue doing anything?&amp;rdquo; If not, &lt;code&gt;Must&lt;/code&gt; and &lt;code&gt;ValueOrDie&lt;/code&gt; are appropriate.&lt;/p&gt;
&lt;h2 id="use-it"&gt;Use it!&lt;/h2&gt;
&lt;p&gt;I hereby place the code in this blog post in the public domain
(&lt;a href="https://creativecommons.org/public-domain/cc0/"&gt;CC:0&lt;/a&gt;). Please use it for
whatever you want in good health, you can even claim it as your own! It&amp;rsquo;s been
written and rewritten so often that at this point it&amp;rsquo;s folklore and
folklore has no owner. Also, if you can figure out where and how to put a
version of it somewhere in the Go stdlib, please do that, so people don&amp;rsquo;t have
to keep rewriting it, and we can finally settle on the one true implementation
;)&lt;/p&gt;
&lt;p&gt;Happy coding!&lt;/p&gt;</description></item><item><title>Runtime dependencies</title><link>https://triplepat.com/blog/2025/01/09/dependencies</link><pubDate>Thu, 09 Jan 2025 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2025/01/09/dependencies</guid><description>&lt;p&gt;We run a SaaS business for software reliability, which means we want others to
depend on us. In order for others to depend on us, we need a demonstrated
&lt;a href="https://status.triplepat.com"&gt;record of uptime and reliability&lt;/a&gt;, and we need
to be clear about what we depend on at runtime to deliver that service.&lt;/p&gt;
&lt;p&gt;In our &lt;a href="https://triplepat.com/blog/2025/01/03/keep-it-boring"&gt;keep it boring&lt;/a&gt;
post, we listed our dependencies. Some of those were services, and some were
libraries. For the services, some we depend on to work at least once a month,
and some we depend on to work basically all the time. As we get closer to
launch it&amp;rsquo;s a good time to revisit our dependencies and be explicit about them
and see if we are happy with the current state of affairs.&lt;/p&gt;
&lt;h2 id="must-work-all-the-time-at-least-one-cloud-provider"&gt;Must work all the time: at least one cloud provider&lt;/h2&gt;
&lt;p&gt;If all of our cloud providers go down in multiple regions simultaneously
(that&amp;rsquo;s 5 machines spread across &lt;a href="https://aws.amazon.com"&gt;AWS&lt;/a&gt;,
&lt;a href="https://cloud.google.com"&gt;GCP&lt;/a&gt;, and tiny EU-based &lt;a href="https://tilaa.com"&gt;TILAA&lt;/a&gt;),
Triple Pat is down and we can&amp;rsquo;t do anything about that. That said, it seems
vanishingly unlikely that they would all go down simultaneously, and our
cockroach-like strategy of having each server be a master DB means that we can
survive as long as at least one cloud provider is up.&lt;/p&gt;
&lt;p&gt;The critical thing here is that our dependencies on cloud providers is an OR
not an AND. If one cloud provider goes down, we are still up. If two of them
go down, we are still up. If four regions across three providers go down, we
are still up. We need at least one region of at least one cloud provider to be
up.&lt;/p&gt;
&lt;h2 id="must-work-almost-all-the-time-at-least-one-dns-provider"&gt;Must work almost all the time: at least one DNS provider&lt;/h2&gt;
&lt;p&gt;DNS is scary because &lt;a href="https://dnshaiku.com/"&gt;it&amp;rsquo;s always DNS&lt;/a&gt; and DNS is &lt;a href="https://dependency.itwasdns.net/"&gt;a
critical dependency of the Internet&lt;/a&gt;. We use
&lt;a href="https://godaddy.com"&gt;Godaddy&lt;/a&gt; as our registrar and our provider for
&lt;a href="https://triplepat.com"&gt;triplepat.com&lt;/a&gt; and it&amp;rsquo;s competitor
&lt;a href="https://cloudflare.com"&gt;CloudFlare&lt;/a&gt; as our registrar and provider for
&lt;a href="https://triplepat.net"&gt;triplepat.net&lt;/a&gt;. Our phone apps will try names from both
domains before giving up, and so will all check-ins that follow best practices.
If DNS goes down then every system that does not have a cached DNS record for
our systems can no longer contact our systems, but users can still contact our
systems as long as at least one of our DNS providers is up.&lt;/p&gt;
&lt;p&gt;We can survive a few minutes of DNS downtime every now and then thanks to DNS
caching. We have two independent names using two different registrars and
providers, so it would have to be a pretty major disaster to take both of them
down for more than a few minutes.&lt;/p&gt;
&lt;h2 id="must-work-pretty-well-tailscale"&gt;Must work pretty well: Tailscale&lt;/h2&gt;
&lt;p&gt;We use &lt;a href="https://tailscale.com"&gt;Tailscale&lt;/a&gt; for our internal network to allow our
DBs to sync their records. Tailscale has a good design where even if the
Tailscale servers go down for a bit, it is highly likely that our internal
traffic will still flow. This means that we can survive periods of Tailscale
downtime without worry. Their robustness to their own failure is part of the
reason we use them! If they went down for days we would be in trouble, but we
don&amp;rsquo;t expect that to happen. If Tailscale goes down for an extended period, we
have a manual failover process that we can use to route around their service.&lt;/p&gt;
&lt;h2 id="must-work-at-least-once-every-3-months-letsencrypt"&gt;Must work at least once every 3 months: LetsEncrypt&lt;/h2&gt;
&lt;p&gt;We use &lt;a href="https://letsencrypt.org"&gt;LetsEncrypt&lt;/a&gt; to get SSL certificates for our
domains. We renew our certificates regularly. If we don&amp;rsquo;t have a valid
certificate, we can&amp;rsquo;t serve HTTPS traffic. This is a critical dependency for
us. That said, the certificates are valid for 3 months, and we renew them at
least weekly. So we can survive a few months of LetsEncrypt downtime. If
LetsEncrypt disappears, we have months to fix it before our users notice a
problem.&lt;/p&gt;
&lt;h2 id="thats-all"&gt;That&amp;rsquo;s all!&lt;/h2&gt;
&lt;p&gt;In steady state, those are all of our runtime dependencies for check-ins. As a
company, we rely on more services that don&amp;rsquo;t directly affect users&amp;rsquo; check-ins:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href="https://play.google.com"&gt;Play Store&lt;/a&gt; and &lt;a href="https://apps.apple.com"&gt;App
Store&lt;/a&gt; and their subscription systems
allow users to download and use the App andfor us to accept money,
which is important for our long-term existence, but not important
on a per-check-in basis. If those are down for less than a week,
users shouldn&amp;rsquo;t notice. If they are down for more than a week,
it will be in the news and everyone will notice.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hub.docker.com"&gt;Dockerhub&lt;/a&gt; stores our images and
&lt;a href="https://github.com"&gt;Github&lt;/a&gt; stores our code and runs our CI/CD, but them
going down impedes our development process not our runtime systems.&lt;/li&gt;
&lt;li&gt;We use Google for &lt;a href="https://drive.google.com"&gt;Drive&lt;/a&gt;,
&lt;a href="https://docs.google.com"&gt;Docs&lt;/a&gt;, and &lt;a href="https://mail.google.com"&gt;Gmail&lt;/a&gt;, but
those going down inconveniences us but won&amp;rsquo;t break our service.&lt;/li&gt;
&lt;li&gt;We (of course!) monitor all of these services and have alerting for them, so
we also rely on &lt;a href="https://grafana.com/cloud"&gt;Grafana Cloud&lt;/a&gt; to alert us when
they go down. If Grafana Cloud goes down for a bit, our systems still work
for our users.&lt;/li&gt;
&lt;li&gt;We use &lt;a href="https://betterstack.com"&gt;Better Stack&lt;/a&gt; for our monitoring our uptime,
but if they go down for a bit our systems still work for our
users. Ironically, &lt;a href="https://betterstack.com"&gt;Better Stack&lt;/a&gt; was
having an
&lt;a href="https://status.betterstack.com/incident/493204?mp=true"&gt;outage&lt;/a&gt; at
time of publication, so our uptime proof was down, a nice example
of the intricacies of caring about reliability!&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The bottom line is that for our users (you!), as long as at least one cloud
provider is working and at least one DNS provider is working, then our service,
our app, and your check-ins should all work just fine. Anything else being
down might inconvenience us, but you should be fine.&lt;/p&gt;</description></item><item><title>Keep it Boring</title><link>https://triplepat.com/blog/2025/01/03/keep-it-boring</link><pubDate>Fri, 03 Jan 2025 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2025/01/03/keep-it-boring</guid><description>&lt;p&gt;We are selling a service to people who care about reliability, and people who know reliability
know that keeping things simple is one of the best steps on the path to keeping things reliable.&lt;/p&gt;
&lt;p&gt;So here&amp;rsquo;s what we use!&lt;/p&gt;
&lt;p&gt;We use &lt;a href="https://boringtechnology.club/"&gt;default, boring technology&lt;/a&gt; whenever
possible. Every time we are clever, we should know the reason why. We use our cleverness on
the core of the service (keeping the check-in service reliable using master-master DBs and CRDTs),
and we keep the rest simple.&lt;/p&gt;
&lt;p&gt;This means we use:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://sqlite.org/"&gt;SQLite&lt;/a&gt; for our database.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tailscale.com/"&gt;Tailscale&lt;/a&gt; for internal networking.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://letsencrypt.org/"&gt;Letsencrypt&lt;/a&gt; for SSL/TLS certificates.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cert-manager.io/"&gt;cert-manager&lt;/a&gt; to manage them on each host.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nginx.org/"&gt;nginx&lt;/a&gt; for our public web server and reverse proxy.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docker.com/"&gt;Docker&lt;/a&gt; for containerization.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hub.docker.com/"&gt;Docker Hub&lt;/a&gt; to hold our container images.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/"&gt;Prometheus&lt;/a&gt; to generate our metrics.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://grafana.com/"&gt;Grafana&lt;/a&gt; for our monitoring dashboards.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/"&gt;Github&lt;/a&gt; for our source code management.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/features/actions"&gt;Github Actions&lt;/a&gt; for CI/CD.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://golang.org/"&gt;Go&lt;/a&gt; for our server programming language.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kotlinlang.org/"&gt;Kotlin&lt;/a&gt; and &lt;a href="https://swift.org/"&gt;Swift&lt;/a&gt; for our mobile programming languages.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/"&gt;AWS&lt;/a&gt; and &lt;a href="https://cloud.google.com/"&gt;GCP&lt;/a&gt; for our cloud providers, and only vanilla EC2/GCE machines to avoid lock-in&lt;/li&gt;
&lt;li&gt;&lt;a href="https://godaddy.com/"&gt;Godaddy&lt;/a&gt; and &lt;a href="https://cloudflare.com"&gt;CloudFlare&lt;/a&gt; for our DNS providers&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gohugo.io"&gt;Hugo&lt;/a&gt; for our static website generator&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We also use every available linter and formatter to make sure our code, configs,
and output are internally clean and consistent and align with the expectations
of the outside world.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s been a great experience so far, and we&amp;rsquo;re excited to see how it goes.&lt;/p&gt;</description></item><item><title>How the Triple Pat website works</title><link>https://triplepat.com/blog/2024/10/17/how-the-website-works</link><pubDate>Thu, 17 Oct 2024 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2024/10/17/how-the-website-works</guid><description>&lt;p&gt;The website is a static website, built with &lt;del&gt;&lt;a href="https://11ty.dev/"&gt;11ty&lt;/a&gt;&lt;/del&gt; &lt;a href="https://gohugo.io/"&gt;Hugo&lt;/a&gt;, and hosted in Google Cloud. Also running on this server is an instance of the Triple Pat check-in server, and the main instance of the Triple Pat user service. These three systems, as well as the monitoring system, are all behind an &lt;a href="https://nginx.com/"&gt;nginx&lt;/a&gt; web server which takes care of routing requests appropriately.&lt;/p&gt;
&lt;p&gt;When building a website, and especially a blog, there&amp;rsquo;s a real question: do I want to have comments? People love comments (mostly), but it&amp;rsquo;s hard to add &amp;ldquo;just a little dynamism&amp;rdquo; to a website without opening up a huge can of worms. We could have used &lt;a href="https://disqus.com/"&gt;Disqus&lt;/a&gt; (which is a cool service! this is not anti-Disqus, in fact they are the only recommended viable alternative to what we did!), but instead we did something a little funky: we made the comments section be &lt;a href="https://bsky.social"&gt;Bluesky&lt;/a&gt; threads! Thanks to &lt;a href="http://aendra.com"&gt;Aendra Rininsland&lt;/a&gt;&amp;rsquo;s &lt;a href="https://github.com/aendra-rininsland/blue-comments"&gt;blue-comments&lt;/a&gt; library, we can add comments to our website with minimal fuss, and we get notified of new comments as if we were chatting on Bluesky, because we are!&lt;/p&gt;
&lt;p&gt;I think it&amp;rsquo;s a really nice system, because we can have the comments section be 100% Javascript powered, and then Bluesky takes care of user identity, moderation, and all of the challenges that are implicit in social apps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If we start getting bad actors in the comments, Bluesky&amp;rsquo;s well-functioning block system means that those comments can be easily removed from view.&lt;/li&gt;
&lt;li&gt;We won&amp;rsquo;t miss a comment, because every comment is a social media notification.&lt;/li&gt;
&lt;li&gt;People are in control of their own words. If they want to delete their comments they can do that on Bluesky and the comment will disappear from the website.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We use a social website to empower the social act of commenting, and they then take care of letting people be in control of their own words and identities. We take care of the thing being talked about. It&amp;rsquo;s a good separation of concerns.&lt;/p&gt;
&lt;p&gt;The only downside is that we have to have the blog conversations a little more publicly (Bluesky threads are a little more publicized than a chat in the comments on a blog, but not much more), but the shift in context is not too radical, and I think it&amp;rsquo;s a good tradeoff for a small blog like this.&lt;/p&gt;</description></item><item><title>Building Triple Pat</title><link>https://triplepat.com/blog/2024/10/16/building-triple-pat</link><pubDate>Wed, 16 Oct 2024 00:00:00 +0000</pubDate><author>peter@triplepat.com (Peter Boothe)</author><guid>https://triplepat.com/blog/2024/10/16/building-triple-pat</guid><description>&lt;p&gt;To build a SaaS for software reliability, we need to make sure we are building a system that is, itself, very reliable. Tony Hoare famously &lt;a href="https://dl.acm.org/doi/pdf/10.1145/1283920.1283936"&gt;said&lt;/a&gt; &lt;q&gt;there are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.&lt;/q&gt;&lt;/p&gt;
&lt;p&gt;To make sure we aren&amp;rsquo;t adding to the problem, we are repeatedly endeavoring to make the system as simple as possible. Along with simplicity, we want to minimize cleverness. Although it feels strange to mix Harry Potter and the ACM Turing Award, the other principle we hew to when constructing this system is &lt;q&gt;Never trust anything that can think for itself if you can&amp;rsquo;t see where it keeps its brain.&lt;/q&gt; So we want to make our systems as dumb as possible, and when they need to be smart, we want to be extra-clear about where the brain is.&lt;/p&gt;
&lt;p&gt;Our check-in system is simplicity itself. It is a service that tracks (UUID, timestamp) pairs, and the only system that is allowed to say what &amp;ldquo;now&amp;rdquo; is, is the server. The internal database maintains the invariant that the time is never allowed to decrease, which means that the (UUID, timestamp) pairs form a &lt;a href="https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type"&gt;CRDT&lt;/a&gt;, which means that our databases can safely operate with every independent node as a master, and we are guaranteed of eventual consistency.&lt;/p&gt;
&lt;p&gt;The only &amp;ldquo;brain&amp;rdquo; of note is the phone app. It repeatedly polls Triple Pat servers (any server in our geographically distributed set of servers) for the last check-in time of each of a user&amp;rsquo;s UUIDs. If the time ever gets too old (configurable in the app), then the app will display an alert. If the Triple Pat servers are down, then the last check-in time will not be updated, so the user will be notified &amp;mdash; our alerting system fails noisily.&lt;/p&gt;
&lt;p&gt;By keeping the database conceptually simple (although master-master mode is always a little complex) and by putting all the brains in the phone app, we can guarantee that if there are no alerts from Triple Pat, then we know for sure that the last check-in time is within acceptable bounds. If the alarm is firing, then Triple Pat may be broken or the user&amp;rsquo;s service may be down. Triple Pat continues to exist and have customers only if we make sure that it is almost always the case that the alarm fires when the user&amp;rsquo;s service is down.&lt;/p&gt;
&lt;p&gt;All of this is conceptually simple, but it requires a relatively broad knowledge-base to actually build. We have two servers built in Go (one centralized system for user identity services, and one distributed system that just maintains the (UUID, timestamp) pairs), a website built with 11ty, phone apps built with Kotlin and Swift, and a complicated deployment strategy with servers spread across lots of providers. Each piece is simple and has a well-defined function, which is the only reason it remains tractable.&lt;/p&gt;</description></item></channel></rss>