Fragments — brandur.org

Caveman

2026-04-12T11:41:03-05:00

An excerpt from Michael Crichton’s Congo (1980):

“I don’t understand,” Elliot said. Ross explained that the “M” meant that there was more message, and he had to press the transmit button again. He pushed the button several times before he got the message, which in its entirety read:

REVUWD ORGNL TAPE HUSTN NU FINDNG RE AURL SIGNL INFO-COMPUTR ANLYSS COMPLTE THNK ITS LNGWGE.

Elliot found he could read the compressed shortline language by speaking it aloud: “Reviewed original tape Houston, new finding regarding aural signal information, computer analysis complete think it’s language.” He frowned. “Language?”

Crichton was a gear guy. The story’s protagonists took high tech satellite uplinks into the field, allowing transmission back to HQ, but due to the extreme expense of satellite bandwidth, having to read messages in shorthand like, “REVUWD ORGNL TAPE HUSTN NU FINDNG”.

I always found it ridiculous. Although these words have had their vowels removed, they’re still uniquely intelligible in the English language. It’d be trivial to write a short algorithm that’d use a dictionary to expand the message back to uncompressed English on the receiving end. Or better yet, stop with the vowel thing and use a standard compression algorithm ¹. You’d get better results.

Yesterday, I came across Caveman. Its job is to save tokens in Claude by having the LLM speak like a caveman, removing filler words and other niceties that make up a more fluently legible human language.

Before:

“Sure! I’d be happy to help you with that. The issue you’re experiencing is most likely caused by your authentication middleware not properly validating the token expiry. Let me take a look and suggest a fix.”

After:

“Bug in auth middleware. Token expiry check use < not <=. Fix:”

Crichton would’ve loved it. 45 years later we’ve come full circle, are back to speaking like cavemen again, and as an at-least-somewhat legitimate technical workaround. I don’t know what I thought I knew anymore.

"Somewhere" (2010) review

2026-04-06T19:09:44-05:00

I’ve often cited Sofia Coppola’s Lost in Translation (2003) as one of my favorite movies. I’d never dug much into Coppola’s other work, so imagine my delight to discover that she’s made another movie, Somewhere (2010) with a similar premise.

I excitedly got to watching it, but was ultimately disappointed. There’s room for two movies to have similar premises, but Somewhere takes that to another level. It’s functionally the same film.

The macro/themes are the same – disengaged, burned-out actor stays long-term at a hotel. A young woman comes into his life with whom he feels a genuine human connection. She helps break his sad routine and rediscover joy. One is in Tokyo, one is in LA. In one the woman is a much younger stranger, in the other his daughter.

But overarching story aside, even specific scenes are strongly derivative:

There’s an absurdist foreign interview of the lead in each.
Both heavily feature scenes of characters lying in beds.
Each has meta-scenes of leads watching TV.
Both include a scene of another woman sleeping over with the lead, and the awkward morning after interaction with the young woman about it.
There are scenes of the characters swimming around in upscale hotel pools.
Each has a scene of the lead watching strippers.

I understand having a few callbacks in there to the filmmaker’s previous work, but this is something else.

Lost in Translation is clearly the distantly better movie. My takeaway is that although it had a good script, Bill Murray and the overwhelming chemistry between him and Scarlett Johansson carried that movie. Switch out those two leads, and it’s very possible that like Somewhere, almost no one would have heard of it.

The special hell of Bolt, Europe's Uber clone

2025-07-15T07:50:08-07:00

I was in Latvia a few weeks ago. Riga’s one of the Europeans cities without a good transit link from the airport into city. Snooping around online, I found that the recommended way to get a ride was the use of an app called Bolt, a European clone of Uber. I realize now that I didn’t actually check that Uber wasn’t available in Latvia, but I’m not against experimenting with a new app here and there.

I used it twice to get to and from the city center, and it worked perfectly. Neither of my drivers spoke English and I didn’t speak a word of Latvian, but that’s what technology’s for. The rides went off without a hitch and I got exactly where I was supposed to be both times.

I arrived in Lyon recently and figured, hey, this is Europe, why not try the European app again, and used Bolt.

Ride attempt no. 1

Car pulls into airport, drives to the waiting spot, stops up ahead of me, I walk over to it, driver pulls away, and leaves the airport. Mystified, I photograph the guy’s license plate as he drives off figuring I might need it for dispute evidence.

The driver doesn’t cancel the ride as he rides off into the distance, leaving me to do it, presumably so it falls to me to pay the app’s €7 cancellation fee.

Ride attempt no. 2

I cancel and try again. I get a ride parked not far off, but with a message: “This is an automated acceptance. This car is set to charge for another 45 minutes.” Sure enough, it’s unmoving and unresponsive, and eventually the ride times out (thankfully, avoiding another €7 charge).

Ride attempt no. 3

No message this time, but another car that appears to be charging and/or long term parked (it’s a Tesla, so I suspect charging again). I leave the app, waiting for the pick up to time out.

Ride attempt no. 4

I give up on Bolt, and switch to Uber. I match a driver right away. It’s almost suspicious how quickly I matched him. But this is good! Progress. He drives over and I walk up to meet him. I get in the car and we start moving. Finally, this fiasco is over.

But then a guy runs up to the driver’s window. Hey, he shouts, you’re our ride! We booked you on Bolt. We just talked about on the phone a few minutes ago, remember?

Knowing that his license plate and photo matches what’s on their screen, the driver doesn’t bother denying it, and instead just points to his phone’s screen and says, I pick up Brandur. See?

Even as the car’s “winner” (I’m not sure if this was because I got to the car first or the Uber fare was more favorable for the driver), I have principles, and of course don’t love this situation either, but my only alternative would be to get out and cancel the ride, for which I’d surely get hit with another fee. Unfortunately my best option is to stay quiet about it, let the Bolt user get another ride, and give the driver a low rating later. Naturally, the driver didn’t cancel the other guy’s Bolt ride (at least as far as I observed from the back seat), which would’ve left the user to eat the €7 fee.

As we drove away from the airport, I suddenly realized: wait! this must be what happened to me during my first ride.

Ride attempt no. 1, part 2

I go back into the Bolt app and open a support conversation. This option is purposely hidden deep inside submenus of submenus of submenus, so it took me five minutes to find it. I explain what happened and include the photographic evidence. From the first response it’s obvious they have me talking to an AI. I drop all formality, and type only the minimum viable number of characters to get the next response. The AI promises me a refund for my €7 cancellation fee, then proceeds to provide no refund.

Eventually I’m escalated to a human operator, who somehow manages to be worse than the AI. After explaining the situation again, I’m told that fine, in this extremely rare, never-before-seen, once-in-a-cosmic-era situation, they’ll refund the €7 fee. But don’t fuck up again!

Don’t worry Bolt, I won’t. My days of using you scam peddlers are over.

When something works well enough, it’s easy to take it for granted. As much flak as Uber and Lyft take, my experience with Bolt made me stop and think that even given 10+ years and hundreds of rides on both apps, my bad experiences have numbered like maybe, two? That sort of quality bar isn’t an easy thing to maintain.

Occasionally injected clocks in Postgres

2025-06-29T10:48:16-07:00

In a standard app deployment that’s scaled horizontally across many nodes, we can expect the clocks to be a little askew across the fleet. It’s generally not a huge problem these days because our use of NTP is so good and so widespread, but minor drift is still present.

Where a single source of time authority is desired, a nice trick is to use the database. A single database is shared across all deployed nodes, so by using the database’s now() function instead of time.Now() in code, we can expect perfect consistency across all created records.

But a downside of this approach is that it makes time hard to stub because Postgres’ time is hard to stub. Stubbing time is often a necessity in tests and not being able to do so is a deal breaker.

We’ve been using a hybrid approach with some success. A call to coalesce prefers an injected timestamp if there is one, but falls back on now() most of the time (including in production) to share a clock.

Step 1: SQL + sqlc

Here’s a sample query showing the coalesce in action. sqlc.narg defines a parameter as nullable.

-- name: QueuePause :execrows
UPDATE queue
SET paused_at = CASE
                WHEN paused_at IS NULL THEN coalesce(
                    sqlc.narg('now')::timestamptz,
                    now()
                )
                ELSE paused_at
                END
WHERE name = @name;

In sqlc.yaml, tell sqlc to emit nullable timestamps as *time.Time pointers:

version: "2"
sql:
  - engine: "postgresql"
    queries: ...
    schema: ...
    gen:
      go:
        overrides:
          - db_type: "timestamptz"
            go_type:
              type: "time.Time"
              pointer: true
            nullable: true

Which generates this code:

const queuePause = `-- name: QueuePause :execrows
UPDATE queue
SET
    paused_at = CASE WHEN paused_at IS NULL THEN coalesce($1::timestamptz, now()) ELSE paused_at END
WHERE CASE WHEN $2::text = '*' THEN true ELSE name = $2 END
`

type QueuePauseParams struct {
    Now  *time.Time
    Name string
}

func (q *Queries) QueuePause(ctx context.Context, db DBTX, arg *QueuePauseParams) (int64, error) {
    result, err := db.Exec(ctx, queuePause, arg.Now, arg.Name)
    if err != nil {
        return 0, err
    }
    return result.RowsAffected(), nil
}

Step 2: Stubabble time generator

Working in Go, define a TimeGenerator interface:

When unstubbed, it returns the current time from NowUTC() or nil from NowUTCOrNil().
When stubbed, it returns the stubbed time from NowUTC() or a pointer version of the same from NowUTCOrNil().

// TimeGenerator generates a current time in UTC. In test
// environments it's implemented by TimeStub which lets the
// current time be stubbed. Otherwise, it's implemented as
// UnstubbableTimeGenerator which doesn't allow stubbing.
type TimeGenerator interface {
    // NowUTC returns the current time. This may be a stubbed
    // time if the time has been actively stubbed in a test.
    NowUTC() time.Time

    // NowUTCOrNil returns if the currently stubbed time _if_
    // the current time is stubbed, and returns nil otherwise.
    // This is generally useful in cases where a component may
    // want to use a stubbed time if the time is stubbed, but
    // to fall back to a database time default otherwise.
    NowUTCOrNil() *time.Time
}

A stubbable implementation for tests:

type TimeStub struct {
    nowUTC *time.Time
}

func (t *TimeStub) NowUTC() time.Time {
    if t.nowUTC == nil {
        return time.Now().UTC()
    }

    return *t.nowUTC
}

func (t *TimeStub) NowUTCOrNil() *time.Time {
    return t.nowUTC
}

func (t *TimeStub) StubNowUTC(nowUTC time.Time) time.Time {
    t.nowUTC = &nowUTC
    return nowUTC
}

An unstubbable time generator for production:

type UnstubbableTimeGenerator struct{}

func (g *UnstubbableTimeGenerator) NowUTC() time.Time       { return time.Now() }
func (g *UnstubbableTimeGenerator) NowUTCOrNil() *time.Time { return nil }

func (g *UnstubbableTimeGenerator) StubNowUTC(nowUTC time.Time) time.Time {
    panic("time not stubbable outside tests")
}

Step 3: Distributing a shared time generator

The next key aspect is that all code needs to share a single instance of TimeGenerator so that when it’s stubbed from a test, all services and subservices get the same stubbed value.

We put a TimeGenerator on a base service archetype that’s automatically injected from top-level services to subservices:

func (c *Client[TTx]) QueuePauseTx(ctx context.Context, tx TTx, name string, opts *QueuePauseOpts) error {
    executorTx := c.driver.UnwrapExecutor(tx)

    if err := executorTx.QueuePause(ctx, &QueuePauseParams{
        Name:   name,
        Now:    c.baseService.Time.NowUTCOrNil(), // <-- accessed here
        Schema: c.config.Schema,
    }); err != nil {
        return err
    }

By default, it’s instantiated as UnstubbableTimeGenerator. From tests, it’s a TimeStub:

func BaseServiceArchetype(tb testing.TB) *baseservice.Archetype {
    tb.Helper()

    return &baseservice.Archetype{
        Logger: Logger(tb),
        Time:   &TimeStub{},
    }
}

In a test, time is stubbed like:

stubbedNow := client.baseService.Time.StubNowUTC(time.Now().UTC())

Loose conviction

Consider this one a loose recommendation. It’s useful in some situations where timestamp consistency is critically important, but not in others where it isn’t. Server clocks tend to be pretty good nowadays, and it’s a lot of code to avoid a few tens of microseconds worth of drift.

Also, consider that there might be a downside to using the database clock. In SQL, CURRENT_TIMESTAMP and now() in Postgres represent the current time at the start of the current transaction rather than the current time. This might be a benefit as all records created during a transaction are assigned the same created time, but it’s just as often undesirable because depending on the duration of the transaction, timestamps can be wildly unrepresentative of when things actually happened.

Testing the graceful handling of request cancellation in Go, 499s

2025-06-20T00:16:09+02:00

We had a situation a few days ago where a lazy loading problem in our Ruby code led to long running requests that our Dashboard, with an optimistic five second deadline on backend requests, was timing out. This raised a question in Slack: if our frontend does time out a backend request, does the request keep running? Or does the API know how to save resources by abandoning it midway through?

If the API stack’s being bombarded by expensive requests that are largely being canceled early, it’s a huge optimization to make sure that they only use the resources that they absolutely to. Requests discarded early stop executing immediately and no further effort is put toward servicing them.

In most code I’ve ever worked in, I could quite confidently answer the question above with a definitive and resounding “no”. Doing a good job of request cancellation requires it be baked quite deeply into language and low level libraries, which isn’t common. And even when those handle it well, userland code usually doesn’t. Also, cancelling a request midway in services that don’t use transactions would be unacceptably dangerous – mutated state would be left mutated, and that’d cause untold trouble later on.

Cancellation in Go

But in a Go stack, the built-in HTTP server should handle cancellations using context:

For incoming server requests, the context is canceled when the client’s connection closes, the request is canceled (with HTTP/2), or when the ServeHTTP method returns.

And with our code being widely safeguarded by transactions, the feature should even be safe to use!

Now prove it

Theory is one thing, but reality is another. If request cancellations indeed work, we should be able to prove it, so I set up a little bootstrap in pursuit of that. To make testing easy, add an artificial API endpoint waiting on sleep or context finished:

select {
case <-time.After(5 * time.Second):
case <-ctx.Done():
        return nil, ctx.Err()
}

Start the API server. Then from another terminal, run cURL and interrupt it after a few seconds:

$ curl -i http://localhost:5222/sleep
^C

I found that we were handling canceled requests reasonably well, but that the error we were logging wasn’t right. The code was checking context cancellation, but getting confused between context that was canceled from the HTTP server versus one canceled by our built-in timeout middleware, improperly sending a 408 Request timeout to logs.

Local vs. request context

After a little refactoring, I ended up with this code:

func (e *APIEndpoint[TReq, TResp]) Execute(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()

    // Add a default timeout for all API requests to ensure there's
    // always a backstop in case of degenerate behavior. Rescued
    // below and turned into a more user-friendly error.
    ctx, cancel := context.WithTimeout(ctx, RequestTimeout)
    defer cancel()
    
    ...
    
    ret, err := e.serviceHandler(ctx, req)
    if err != nil {
        if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
            // Distinct error message when the request itself was
            // canceled above the API stack versus we had a
            // cancellation/timeout occur within the API endpoint.
            if r.Context().Err() != nil {
                // This is a non-standard status code (499), but
                // fairly widespread because Nginx defined it.
                err = apierror.NewClientClosedRequestError(ctx, errMessageRequestCanceled).WithSpecifics(ctx, err)
            } else {
                err = apierror.NewRequestTimeoutError(ctx, errMessageRequestTimeout).WithSpecifics(ctx, err)
            }
        }

        WriteError(ctx, w, err)
        return
    }

Should a context error occur, we return a 408 Request timeout in case of a timeout on local ctx, but a 499 Client closed request if context was canceled upstream by the HTTP server canceling r.Context().

499 isn’t real status code, but rather one invented by Nginx which happens to be useful here. It doesn’t really matter what status code we use because the end user (who canceled the request before the status code returned) will never see it. It’s purely for our own logging and telemetry.

Looking at local logs running the sleep/cancel routine, I now see this:

canonical_api_line GET /sleep -> 499 (4.162702459s)
    api_error_cause="context canceled"
    api_error_internal_code=client_closed_request
    api_error_message="Context of incoming request canceled; API endpoint stopped executing."

Generalizing cancellation handling

Although our demo uses an artificial sleep statement, importantly this still works for any normal requests. Our code isn’t littered with <-ctx.Done() checks all over the place, but it does have a great many database operations like this one:

account, err := dbsqlc.New().AccountTouchLastSeenAt(ctx, e, apiKey.AccountID)
if err != nil {
    return nil, xerrors.Errorf("error looking up account: %w", err)
}

These call into Sqlc which call into Pgx, and Pgx detects a canceled context and sends back an error. In the event of a canceled request, the first database operation will come back with an error that’ll bubble back up the stack to our API endpoint infrastructure. There it’ll be turned it into a 499. Subsequent database operations won’t run, saving time and resources.

// API service handler error handling. Repeated from above.
ret, err := e.serviceHandler(ctx, req)
if err != nil {
    if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
        // Distinct error message when the request itself was
        // canceled above the API stack versus we had a
        // cancellation/timeout occur within the API endpoint.
        if r.Context().Err() != nil {
            // This is a non-standard status code (499), but
            // fairly widespread because Nginx defined it.
            err = apierror.NewClientClosedRequestError(ctx, errMessageRequestCanceled).WithSpecifics(ctx, err)
        } else {
            err = apierror.NewRequestTimeoutError(ctx, errMessageRequestTimeout).WithSpecifics(ctx, err)
        }
    }

    WriteError(ctx, w, err)
    return
}

Pgx is one example of a library that’ll check context cancellation, but it’ll generally occur in any low level library that’s doing I/O. As another example, SDKs like AWS or Stripe will usually go through net/http, which will catch them.

With code exercised (and adequate new testing in place), I was confident returning to Slack and declaring that “yes”, request cancellation is handled smoothly. I can’t say the same about our Ruby code, but that’s an adventure for another day.

Be careful with Dropbox

2025-05-26T00:28:04-06:00

I’ve been a Dropbox users going on fifteen years now. It’s one of the most frustrating products in my arsenal because fifteen years ago it was perfect, but every new release just makes it a little bit worse than it was before. It’s still fine to use, but you can see the writing on the wall as the long term trend is all in the wrong direction.

Despite that, I previously would’ve lavished it with praise in that I’ve never once had trouble with data loss or data integrity. Despite increasing feature bloat, it did what it was supposed to, syncing files to the right places, and doing so reliably, which is pretty much all I need out of it.

That ended Friday, when I was installing Dropbox on a new laptop. My Dropbox size runs ~500 GB, so when credentialing a new machine, I copy it from another computer on the network for speed, and to conserve precious bandwidth ¹ :

Rsync ~/Dropbox from an existing computer to the new one.
brew install dropbox. Open it, log in, close it.
Replace the contents of ~/Dropbox with the rsynced copy.
Open Dropbox, let it sync against the new data. It should find everything it needs already there.

Dropbox made a change in the last couple years wherein they moved the standard ~/Dropbox on Mac to a new ~/Library/CloudStorage/Dropbox location. I now know that folders in this directory are meant for use with Apple’s File Provider API.

Apparently the change had been introduced for macOS Ventura (two major versions ago), but there must’ve been an incremental roll out because I set up a computer last year and didn’t run into it then. Once you’ve been opted into the feature, you cannot opt out. Changing back to ~/Dropbox is not an option.

Normally I do a wholesale swap of ~/Dropbox with my locally copied version, but seeing this new magic folder in ~/Library, I worried there’d be some irreversible effect if I did it the normal way. Instead, I closed Dropbox, cded into the folder to rm all the files acting as cloud “stubs”, intending to replace them with materialized versions from my local copy.

What a mistake. I dumbly assumed that with Dropbox closed, any changes I made to the folder would be safe, just like they were in every previous version of Dropbox. Not so. At all.

I got suspicious after about ten seconds. Normally an rm even on gigantic directories is near instant, but this one was running long. I SIGINTed it, but the damage was done.

I’m sure you guessed what happened already. ~/Library/CloudStorage is a magic location, and folders in it use macOS extension voodoo to make arbitrary changes in a cloud storage API. Despite Dropbox not being open, it’d used a Mac API to intercept the rm and started to remove everything. My other computers had already synced the deletions. 100s of GBs gone in seconds.

Dropbox has a good “undelete” function, so I was able to log into their web UI and recover all the deleted files, but I was left with the problem of all my other computers having purged their local contents, with potentially 100s of GBs on each needing to be synced back down (and I thought I was saving bandwidth when I started doing this). Worse yet, Dropbox puts any files it deletes into a ~/Dropbox/.dropbox.cache directory, but can’t reuse any of that data when files are recovered, so it just makes a copy. Dropbox doesn’t purge its cache often, even if disk space gets critically low, so every computer potentially needed 2 * 500 GB =~ 1 TB of free space for the full recovery, which they didn’t have.

Two days later, I got everything back to where it should be, but all I could think afterwards was what a stupid, unforced error this all was. A mandatory move to ~/Library/CloudStorage/Dropbox/File Provider API has no marginal utility for the user ², even if it makes product managers at the company feel good about themselves.

Being particular incensed at this moment, I started looking into alternatives immediately. There’s dozens out there, but my approximate evaluation is that there isn’t one that’s a crystal clear, unambiguous win that’d I’d be excited about doing the work to switch over to, which is too bad.

What I’m really looking for is Dropbox circa 2011. The one without the gratuitous/dangerous product changes, without an Electron app, and without the nags to upgrade my account which I already pay $120/year for.

Anyway, I doubt most users will run into this one as it was a confluence of stupid things that led me down this path, but I’d just caution like the title says: be careful with Dropbox. Don’t rm too much. Don’t assume intuitive cause and effect. Don’t assume operations are safe even if the app is closed.

Optimizing JPEGs with MozJPEG for local archival

2025-03-29T12:35:10-07:00

Call me old fashioned, but I like to keep my photo collection as local files on disk rather than symbolic pointers in the cloud, or sent off to deep storage on large archival drives, neither of which I’m likely to ever look at again. It’s nice having quick access to them that still works over a bad internet link or on an airplane.

It’s a great system, but it’s been getting more difficult as time goes by. My photo collection grows year by year, but Apple’s hard drive sizes stay frozen circa 2012. I’m running the same 1 TB drive that I was five years ago, which is only incrementally larger than five years before that (and even the mizerly 1 TB is still a $200 upcharge over the default 512 GB that’s somehow a thing that Apple sells in 2025).

Realistically, I know that I’ll never look at the majority of these photos again, so I already prune the collections aggressively to keep only the highlights, but was looking for storage opportunities beyond that. Years ago I wrote about optimizing JPEGs for this site using MozJPEG, and knowing that a lot of cameras produce suboptimally compressed JPEGs, realized there was a similar opportunity for archival.

I ended up writing a wrapper around around MozJPEG that saves about 80% of space compared to what comes out of my camera. Here’s a sample run:


    $ optimize 001-ana-nuevo/*
    created: 001-ana-nuevo/2W4A6210.jpg (9.02MB -> 2.11MB / saved 77%)
    created: 001-ana-nuevo/2W4A6212.jpg (8.21MB -> 1.79MB / saved 78%)
    created: 001-ana-nuevo/2W4A6216.jpg (11.0MB -> 2.68MB / saved 76%)
    created: 001-ana-nuevo/2W4A6218.jpg (6.36MB -> 1.29MB / saved 80%)
    created: 001-ana-nuevo/2W4A6219.jpg (12.11MB -> 3.01MB / saved 75%)
    created: 001-ana-nuevo/2W4A6224.jpg (7.3MB -> 1.69MB / saved 77%)
    created: 001-ana-nuevo/2W4A6228.jpg (7.75MB -> 1.72MB / saved 78%)
    created: 001-ana-nuevo/2W4A6230.jpg (8.62MB -> 1.99MB / saved 77%)
    created: 001-ana-nuevo/2W4A6236.jpg (8.14MB -> 1.87MB / saved 77%)
    created: 001-ana-nuevo/2W4A6237.jpg (6.65MB -> 1.48MB / saved 78%)
    created: 001-ana-nuevo/2W4A6238.jpg (7.59MB -> 1.69MB / saved 78%)
    created: 001-ana-nuevo/2W4A6240.jpg (9.38MB -> 2.21MB / saved 76%)
    created: 001-ana-nuevo/2W4A6242.jpg (9.26MB -> 2.22MB / saved 76%)
    created: 001-ana-nuevo/2W4A6243.jpg (10.17MB -> 2.44MB / saved 76%)
    created: 001-ana-nuevo/2W4A6247.jpg (10.49MB -> 2.56MB / saved 76%)
    created: 001-ana-nuevo/2W4A6251.jpg (7.92MB -> 1.84MB / saved 77%)
    created: 001-ana-nuevo/2W4A6252.jpg (8.97MB -> 2.12MB / saved 76%)
    created: 001-ana-nuevo/2W4A6253.jpg (7.74MB -> 1.75MB / saved 77%)
    created: 001-ana-nuevo/2W4A6254.jpg (9.43MB -> 2.3MB / saved 76%)
    created: 001-ana-nuevo/2W4A6255.jpg (10.78MB -> 2.65MB / saved 75%)
    created: 001-ana-nuevo/2W4A6258-pups.jpg (9.13MB -> 2.22MB / saved 76%)
    created: 001-ana-nuevo/2W4A6259.jpg (10.46MB -> 2.55MB / saved 76%)
    created: 001-ana-nuevo/2W4A6260.jpg (8.54MB -> 2.04MB / saved 76%)
    created: 001-ana-nuevo/2W4A6262.jpg (10.3MB -> 2.59MB / saved 75%)
    created: 001-ana-nuevo/2W4A6266.jpg (8.81MB -> 2.19MB / saved 75%)
    created: 001-ana-nuevo/2W4A6267.jpg (9.64MB -> 2.31MB / saved 76%)
    created: 001-ana-nuevo/2W4A6268.jpg (9.83MB -> 2.33MB / saved 76%)
    created: 001-ana-nuevo/2W4A6269.jpg (8.93MB -> 2.14MB / saved 76%)
    created: 001-ana-nuevo/2W4A6271.jpg (7.38MB -> 1.74MB / saved 76%)
    created: 001-ana-nuevo/2W4A6272.jpg (7.19MB -> 1.68MB / saved 77%)
    created: 001-ana-nuevo/2W4A6283-water-fight.jpg (7.65MB -> 1.73MB / saved 77%)
    created: 001-ana-nuevo/2W4A6284.jpg (8.02MB -> 1.77MB / saved 78%)
    created: 001-ana-nuevo/2W4A6286.jpg (5.82MB -> 1.11MB / saved 81%)
    created: 001-ana-nuevo/2W4A6287.jpg (6.03MB -> 1.14MB / saved 81%)

I’m sure there’s some subtle downside to the extra compression, but I’ve tried zooming all the way in on a couple samples before and after, and I can see differences right at the pixel level, but the optimized version isn’t clearly worse to my eye.

My script’s use-at-your-own-risk me-ware that I’m not publishing in any official sense, but here it is for reference.

Some gotchas I ran into and which might save someone else time/trouble:

The MozJPEG binary to compress JPEGs is called cjpeg. This is an old Linux style project, and naming the binary after the project would make things too easy and too obvious for users. Under the strict edicts of 1970s Unix philosophy, that’s completely unacceptable.
You might have multiple packages on your system providing cjpeg. Make sure you’re using MozJPEG’s because it offers much better compression than libjpeg or libjpeg-turbo. You can see here that my default cjpeg is not MozJPEG’s:

$ which cjpeg
/opt/homebrew/bin/cjpeg

$ ls -l /opt/homebrew/bin/cjpeg
lrwxr-xr-x@ 1 brandur  admin    36B Feb 10 11:45 /opt/homebrew/bin/cjpeg -> ../Cellar/jpeg-turbo/3.1.0/bin/cjpeg

The original libjpeg cjpeg didn’t support reading JPEGs, only writing them, and would encourage you to read JPEGs with another binary called djpeg and pipe that into cjpeg (again, the wonders of Unix philosophy). You can do that with MozJPEG too, but DO NOT DO THAT! Piping will strip EXIF data, which you shouldn’t do. Unlike libjpeg’s version, MozJPEG’s cjpeg does read JPEGs, so piping is not necessary.
If you’re writing to a new a file and then replacing the original after (which you probably should for safety), make sure to copy the original create/modify timestamps to the new file. The easiest way to do this is with touch -r >`



TOUCH(1)						      General Commands Manual							  TOUCH(1)

NAME
     touch – change file access and modification times

SYNOPSIS
     touch [-A [-][[hh]mm]SS] [-achm] [-r file] [-t [[CC]YY]MMDDhhmm[.SS]] [-d YYYY-MM-DDThh:mm:SS[.frac][tz]] file ...

DESCRIPTION
     The touch utility sets the modification and access times of
     files. If any file does not exist, it is created with default
     permissions.

     By default, touch changes both modification and access times.
     The -a and -m flags may be used to select the access time or
     the modification time individually.  Selecting both is
     equivalent to the default.  By default, the timestamps are set
     to the current time. The -d and -t flags explicitly specify a
     different time, and the -r flag specifies to set the times
     those of the specified file.  The -A flag adjusts the values
     by a specified amount.

     The following options are available:

     ...

     -r      Use the access and modifications times from the
             specified file instead of the current time of day.


Another approach would be to do away with JPEG completely and go to HEIC or WebP, but I’m still finding support for those a little spotty, and navigating them in a file browser feels slow because the compression takes longer to render. I’ll check in on that again in a year or two.



The right way to do data fixtures in Go
2025-03-20T08:56:52-07:00
Every test suite should start early in building a strong convention to generate data fixtures. If it doesn’t, data fixtures will still emerge (they’re that necessary), but in a way that’s poorly designed, with no API (or a poorly designed one), and not standardized.

Other languages tend to have common libraries for fixture generation. As if often does, Go goes its own way and doesn’t have a ubiquitous fixtures package, but especially when combining sqlc and validator, it does well without one.

Here’s one of our project’s 130 fixtures:

package dbfactory

type MultiFactorOpts struct {
    ID          *uuid.UUID              `validate:"-"`
    AccountID   uuid.UUID               `validate:"required"`
    ActivatedAt *time.Time              `validate:"-"`
    ExpiresAt   *time.Time              `validate:"-"`
    Kind        *dbsqlc.MultiFactorKind `validate:"-"`
}

func MultiFactor(ctx context.Context, t *testing.T, e db.Executor, opts *MultiFactorOpts) *dbsqlc.MultiFactor {
    t.Helper()

    validateOpts(t, opts)

    var (
        num          = nextNumSeq()
        numFormatted = formatNumSeq(num)
    )

    multiFactor, err := dbsqlc.New().MultiFactorInsert(ctx, e, dbsqlc.MultiFactorInsertParams{
        ID:          ptrutil.ValOrDefaultFunc(opts.ID, func() uuid.UUID { return ptesting.ULID(ctx).New() }),
        AccountID:   opts.AccountID,
        ActivatedAt: ptrutil.TimeSQLNull(opts.ActivatedAt),
        ExpiresAt:   ptrutil.TimeSQLNull(opts.ExpiresAt),
        Kind:        string(ptrutil.ValOrDefault(opts.Kind, dbsqlc.MultiFactorKindTOTP)),
        Name:        fmt.Sprintf("%s no. %s", ptrutil.ValOrDefault(opts.Kind, dbsqlc.MultiFactorKindTOTP), numFormatted),
    })
    require.NoError(t, err)

    return multiFactor
}


The minimum viable use of the fixture needs only AccountID:

mf := dbfactory.MultiFactor(ctx, t, tx, &dbfactory.MultiFactorOpts{
    AccountID: account.ID,
})


But all salient properties are settable, so a more elaborate use just involves sending more overrides:

expiredMF := dbfactory.MultiFactor(ctx, t, bundle.tx, &dbfactory.MultiFactorOpts{
    AccountID: account.ID,
    ExpiresAt: ptrutil.Ptr(time.Now().Add(-5 * time.Minute)),
    Kind:      ptrutil.Ptr(dbsqlc.MultiFactorKindWebAuthn),
})


Observations

A few aspects worth calling out:


Under the principle of not mocking the database, fixtures are real live data records. They’re queryable using the full expressiveness of SQL, are valid according to the schema’s data types/checks/triggers, and satisfy foreign keys.

Fixtures never return an error, instead failing their input t so that generating a fixture is a one liner for the caller and doesn’t need an if err != nil { ... } check.

Inputs are annotated with the Go validate framework to demarcate required versus non-required or more complex validations as needed. This is a godsend because it keeps validations short (zero additional lines instead of a minimum of three for an if statement) and fast/easy to write.

As few properties are made validate:"required" as possible, with non nullable fields given defaults instead of marked mandatory for the caller to fill. This makes fixtures easier to use and reduces boilerplate at call sites. e.g. name is a required property on multi_factor above, but the fixture generates a sane default.

Insert statements are generated with sqlc.


-- name: MultiFactorInsert :one
INSERT INTO multi_factor (
    id,
    account_id,
    activated_at,
    expires_at,
    kind,
    name
) VALUES (
    @id,
    @account_id,
    @activated_at,
    @expires_at,
    @kind,
    @name
) RETURNING *;



We use of a lot of custom pointer helpers like ptrutil.TimeSQLNull (changes a pointer to a sql.NullTime) and ptrutil.ValOrDefault. Each one of these changes a ~4 line local variable declaration and if block to one LOC that it’s inlined into the insert. True Go dogmatists won’t like this, but it saves dozens of lines per test fixture, and given hundreds of test fixtures, this adds up to thousands of lines saved overall.

Each test case gets its own lazily marshaled monotonic ULID generated based on t. Separate generators guarantee monotonicity even if some test cases rewind their generators to generate ULIDs at particular times.


Organizing with var blocks

Typically, fixtures are generated together in a var ( ... ) block, keeping tests looking nice and tidy:

t.Run("SetNameSSOJoinSCIMError", func(t *testing.T) {
    t.Parallel()

    bundle, ctx := setup(t)

    var (
        org  = dbfactory.Organization(ctx, t, bundle.tx, &dbfactory.OrganizationOpts{SCIMEnabled: true})
        team = dbfactory.Team(ctx, t, bundle.tx, &dbfactory.TeamOpts{OrganizationID: &org.ID})
        _    = dbfactory.AccessGroupAccount_Admin(ctx, t, bundle.tx, team.ID, bundle.account.ID)
    )

    _, err := pservicetest.InvokeHandler(bundle.svc.Update, ctx, &TeamUpdateRequest{
        Name:   ptrutil.Ptr("new name"),
        TeamID: eid.EID(team.ID),
    })
    prequire.APIErrorWithMessage(t, &apierror.BadRequestError{}, fmt.Sprintf(errMessageTeamUpdateSCIM, "name"), err)
})


Standardize conventions, even the small ones

We have a few helpers that are used in almost every test fixture. These are so trivial that they almost don’t need to be extracted into their own functions, but we’ve done so to prevent implementations from drifting and keep code maximally succinct.

// Formats a number like "000007". Typically used in conjunction
// with nextNumSeq to make identifiers prettier and so they align
// better.
func formatNumSeq(num int64) string {
    return fmt.Sprintf("%06d", num)
}

var numSeq int64

// Gets a unique number that can be used in names, etc. and which
// is more friendly to look at than a UUID.
func nextNumSeq() int64 {
    return atomic.AddInt64(&numSeq, 1)
}

func validateOpts(t *testing.T, opts any) {
    t.Helper()

    err := validate.Struct(opts)
    require.NoError(t, err)
}




Profiling production for memory overruns + canonical log stats
2025-02-02T18:02:27-07:00
You’re only lucky for so long. After four years of running our Go API in production with no memory trouble whatsoever, last week we started seeing instantaneous bursts of ~1.5 GB suddenly allocated, enough to cause Heroku to kill the dyno for being “vastly over quota” (our steady state memory use sits around ~50 MB, so we run on 512 MB dynos).

This was of course, concerning. We were only experiencing a few of these a day, but with no idea what was causing them, and having appeared very suddenly, we had to assume that they might get more frequent. Not only is the API suddenly being taken offline at any moment is a bad place to be UX-wise, and even with our careful use of transactions, makes resource leaks between components possible.

Alloc delta

To localize the problem, I used Go’s runtime.MemStats in conjunction with our canonical API lines, making a new total_alloc_delta property available to see how many allocations took place during the period of an API request:

func (m *CanonicalAPILineMiddleware) Wrapper(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        var (
            memStats      runtime.MemStats
            memStatsBegin = m.TimeNow()
        )
        runtime.ReadMemStats(&memStats)
        var (
            memStatsBeginDuration = m.TimeNow().Sub(memStatsBegin)

            // TotalAlloc doesn't decrement on heap frees, so it gives
            // us useful info even if the GC runs during the request.
            totalAllocBegin = memStats.TotalAlloc
        )

        // API request served here
        next.ServeHTTP(w, r)
    
        // Middleware continues ...
        memStatsEnd := m.TimeNow()
        // Since we're only interested in one field, reuse the same
        // struct so we don't need to allocate a second one.
        runtime.ReadMemStats(&memStats)
        var (
            memStatsEndDuration = m.TimeNow().Sub(memStatsEnd)
            totalAllocDelta     = memStats.TotalAlloc - totalAllocBegin
        )
        
        logData := &CanonicalAPILineData{
            ID:                   m.ULID.New(),
            HTTPMethod:           r.Method,
            HTTPPath:             r.URL.Path,
            ...
            ReadMemStatsDuration: timeutil.PrettyDuration(memStatsBeginDuration + memStatsEndDuration),
            TotalAllocDelta:      totalAllocDelta,
        }

        plog.Logger(ctx).WithFields(structToFields(logData)).
            Infof(
                "canonical_api_line %s %s -> %v %s(%s)",
                r.Method,
                routeOrPath,
                logData.Status,
                idempotencyReplayStr,
                duration,
            )


MemStats provides a large bucket of properties to pick from, but TotalAlloc’s a useful one because it represents bytes allocated to the heap, but unlike similar stats like HeapAlloc, it’s monotonically increasing. It’s not decremented as objects are freed:

// TotalAlloc is cumulative bytes allocated for heap objects.
//
// TotalAlloc increases as heap objects are allocated, but
// unlike Alloc and HeapAlloc, it does not decrease when
// objects are freed.
TotalAlloc uint64


This is good because it means that all API requests will end up with the same memory heuristic, and made roughly comparable. Garbage collection may or may not occur during a request. Using TotalAlloc makes it irrelevant whether it did or not.

With that deployed, I can search logs for outliers (:>500000000 means greater than 5 MB):

source:platform app:app[web] canonical_api_line (-http_route:/health-checks/{name})
    total_alloc_delta:>500000000


And voila, we turn up the bad ones. Here, an API request that spiked memory a full 5 GB!

Jan 29 10:18:33 platform app[web] info canonical_api_line
    POST /queries -> 503 (2.53252138s)
total_alloc_delta=5008335944


Parallel allocations

The use of TotalAlloc is imperfect because it not only tracks allocations of the current API request, but allocations across the current API request and all parallel requests.

We can see this effect through false positives:

Feb 1 23:07:18 platform app[web] info canonical_api_line
    GET /clusters/{cluster_id}/databases -> 504 (2m57.322010348s)
total_alloc_delta=743772480


It looks like this API request allocated 744 MB, but what actually happened is that it was a bad timeout that executed for a full three minutes ¹. During that time, other API requests served in the interim allocated the majority of that memory. It didn’t crash our 512 MB dyno because multiple GCs also occurred during that time.

Pprof to S3

Getting our memory overruns localized to a particular endpoint was good, but even having done that, I’d need a little more help to figure out where the rogue memory was going. To that end, I put in one more clause in the middleware so that in case of a huge overrun, the process dumps a pprof heap profile to S3:

    ...

    // If we used a particularly huge amount of memory during the
    // request, upload a profile to S3 for analysis. Buckets have a
    // configured life cycle so objects will expire out after some
    // time.
    if err := m.maybeUploadPprof(ctx, logData.RequestID, totalAllocDelta); err != nil {
        plog.Logger(ctx).Errorf(m.Name+": Error uploading pprof profile: %s", err)
}

...

const pprofTotalAllocDeltaThreshold = 1_000_000_000

func (m *CanonicalAPILineMiddleware) maybeUploadPprof(ctx context.Context, requestID uuid.UUID, totalAllocDelta uint64) error {
    if !m.pprofEnable || totalAllocDelta < m.pprofTotalAllocDeltaThreshold {
        return nil
    }

    profKey := fmt.Sprintf("%s/pprof/%s.prof", m.EnvName, requestID)

    var buf bytes.Buffer
    if err := pprof.WriteHeapProfile(&buf); err != nil {
        return xerrors.Errorf("error writing heap profile: %w", err)
    }

    if _, err := m.aws.S3_PutObject(ctx, &s3.PutObjectInput{
        Body:   &buf,
        Bucket: ptrutil.Ptr(awsclient.S3Bucket),
        Key:    &profKey,
    }); err != nil {
        return xerrors.Errorf("error putting heap profile to S3 at path %q: %w", profKey, err)
    }

    plog.Logger(ctx).Infof(m.Name+": pprof_profile_generated_line: TotalAlloc delta %d exceeded %d; generated pprof profile to S3 key %q",
        totalAllocDelta, m.pprofTotalAllocDeltaThreshold, profKey)

    return nil
}


Our memory problem ended up being a queries endpoint that was overly willing to read giant result sets into memory, then serialize the whole thing into a big JSON buffer for response, which was also pretty indented (and in Go’s encoding/json, indenting a JSON response requires a second giant buffer 2x the size of the first one). I fixed it by reducing the maximum number of rows we were willing to read into the response.

I’m not expecting to run into new memory overruns or leaks anytime soon, but I left the pprof code in place for the time being. It only does work in case of huge memory increases so there’s no performance penalty most of the time, and it might come in handy again.

Stop the world

A token glance at the implementation of runtime.ReadMemStats looks a little concerning:

// ReadMemStats populates m with memory allocator statistics.
//
// The returned memory allocator statistics are up to date as of the
// call to ReadMemStats. This is in contrast with a heap profile,
// which is a snapshot as of the most recently completed garbage
// collection cycle.
func ReadMemStats(m *MemStats) {
    _ = m.Alloc // nil check test before we switch stacks, see issue 61158
    stw := stopTheWorld(stwReadMemStats)

    systemstack(func() {
        readmemstats_m(m)
    })

    startTheWorld(stw)
}


To produce accurate stats, the runtime needs to “stop the world”, meaning that all active goroutines are paused, a sample taken, and resumed.

Intuitively, that seems like it could be pretty slow, and some initial googling seemed to confirm that. However, I later found a patch from 2017 that’d improved the situation considerably by doing cumulative tracking of relevant stats so only a very brief stop the world was required. It indicated a reduction in timing down to 25µs, even at 100 concurrent goroutines.

I added a separate log stat to see how long my two ReadStatMems calls were taking, and found they were averaging ~100µs for both:

read_mem_stats_duration=0.000098s
read_mem_stats_duration=0.000110s
read_mem_stats_duration=0.000113s
read_mem_stats_duration=0.000126s
read_mem_stats_duration=0.000123s
read_mem_stats_duration=0.000084s
read_mem_stats_duration=0.000091s
read_mem_stats_duration=0.000092s
read_mem_stats_duration=0.000090s
read_mem_stats_duration=0.000083s


That’s 50µs per invocation instead of 25µs, but given that a single DB query takes an order or two of magnitude longer at 1-10ms, a little delay to get memory stats is acceptable. If our stack was hyper performance sensitive or saturated with huge request volume, I’d take it out.





Go's bytes.Buffer vs. strings.Builder
2025-01-02T22:34:32-07:00
I was writing some Go code today that generated other Go code. Writing it line by line, mostly in a loop, but with pre- and post-matter.

My usual go to for this type of thing is bytes.Buffer, but after I’d finished the implementation, given that I was working entirely with strings, I started to wonder if I should’ve used strings.Builder instead.

I realized that I had no idea whether one was faster than the other, so I wrote a quick benchmark to check:

package main

import (
    "bytes"
    "strings"
    "testing"
)

var fragments = []string{
    "This",
    "is a series of",
    "string fragments",
    "that will be concatenated together",
    "into a single larger string",
    "so that we can",
    "determine which of Go's various",
    "tools for doing this",
    "is most efficient.",
    "I found a few articles",
    "online",
    "but most were poorly cited",
    "or",
    "behind a Medium login wall",
    "or otherwise",
    "not of admirable quality.",
}

func BenchmarkBytesBuffer(b *testing.B) {
    for range b.N {
        var buf bytes.Buffer

        for _, fragment := range fragments {
            buf.WriteString(fragment)
            buf.WriteString(" ")
        }

        _ = buf.String()
    }
}

func BenchmarkConcatenateStrings(b *testing.B) {
    for range b.N {
        var str string

        for _, fragment := range fragments {
            str += fragment
            str += " "
        }
    }
}

func BenchmarkStringBuilder(b *testing.B) {
    for range b.N {
        var sb strings.Builder

        for _, fragment := range fragments {
            sb.WriteString(fragment)
            sb.WriteString(" ")
        }

        _ = sb.String()
    }
}


$ go test -bench=. -benchmem
goos: darwin
goarch: arm64
pkg: github.com/brandur/go-builder-vs-buffer
cpu: Apple M4
BenchmarkBytesBuffer-10           5013081    217.3 ns/op    1280 B/op    5 allocs/op
BenchmarkConcatenateStrings-10    1603748    753.5 ns/op    5557 B/op    31 allocs/op
BenchmarkStringBuilder-10         6916813    146.9 ns/op    752 B/op     6 allocs/op
PASS
ok      github.com/brandur/go-builder-vs-buffer 4.724s



So there you have it. At least when it comes to concatenating only strings at relatively modest sizes, strings.Builder is about 33% faster, and 80% faster than ¹ than concatenating strings. Given that the DX is identical between the two, I’ll make it my new default go to.





Postgres UUIDv7 + per-backend monotonicity
2024-12-31T15:32:43-07:00
An implementation for UUIDv7 was committed to Postgres earlier this month. These have all the benefits of a v4 (random) UUID, but are generated with a more deterministic order using the current time, and perform considerably better on inserts using ordered structures like B-trees.

A nice surprise is that the random portion of the UUIDs will be monotonic within each Postgres backend:


In our implementation, the 12-bit sub-millisecond timestamp fraction
is stored immediately after the timestamp, in the space referred to as
“rand_a” in the RFC. This ensures additional monotonicity within a
millisecond. The rand_a bits also function as a counter. We select a
sub-millisecond timestamp so that it monotonically increases for
generated UUIDs within the same backend, even when the system clock
goes backward or when generating UUIDs at very high
frequency. Therefore, the monotonicity of generated UUIDs is ensured
within the same backend.


This is a hugely valuable feature in practice, especially in testing. Say you want to generate five objects for testing an API list endpoint. It’s possible they’re generated in-order by virtue of being across different milliseconds or by getting lucky, but probability is against you, and the likelihood is that some will be out of order. A test case has to generate the five objects, then do an initial sort before making use of them. That’s not the end of the world, but it’s more test code and adds noise.

test_accounts = 5.times.map { TestFactory.account }

# maybe IDs were in order, but maybe not, so do an initial sort
test_accounts.sort_by! { |a| a.id }

# API endpoint will return accounts ordered by ID
resp = make_api_request :get, "/accounts"
expect(resp.map { _1["id"] }).to eq(test_accounts.map(&:id))


With Postgres ensuring monotonicity for UUIDv7s, the five generated objects get five in-order IDs, making the test safer ¹ and faster to write. Montonicity isn’t guaranteed across backends, but that’s okay in well written test suites. Patterns like test transactions will guarantee that each test case speaks to exactly one backend.

12 bits more clock

My grasp on monotonicity has always been tenuous at best, so I was curious how it was implemented here. I looked at the patch, and its approach was more obvious than I expected:

/*
 * Generate UUID version 7 per RFC 9562, with the given timestamp.
 *
 * UUID version 7 consists of a Unix timestamp in milliseconds (48
 * bits) and 74 random bits, excluding the required version and
 * variant bits. To ensure monotonicity in scenarios of high-
 * frequency UUID generation, we employ the method "Replace
 * LeftmostRandom Bits with Increased Clock Precision (Method 3)",
 * described in the RFC. This method utilizes 12 bits from the
 * "rand_a" bits to store a 1/4096 (or 2^12) fraction of sub-
 * millisecond precision.
 *
 * ns is a number of nanoseconds since start of the UNIX epoch.
 * This value is used for time-dependent bits of UUID.
 */
static pg_uuid_t* generate_uuidv7(int64 ns) {

...

/*
 * sub-millisecond timestamp fraction (SUBMS_BITS bits, not
 * SUBMS_MINIMAL_STEP_BITS)
 */
increased_clock_precision = ((ns % NS_PER_MS) * (1 << SUBMS_BITS)) / NS_PER_MS;

/* Fill the increased clock precision to "rand_a" bits */
uuid->data[6] = (unsigned char) (increased_clock_precision >> 8);
uuid->data[7] = (unsigned char) (increased_clock_precision);

/* fill everything after the increased clock precision with random bytes */
if (!pg_strong_random(&uuid->data[8], UUID_LEN - 8))
    ereport(ERROR,
            (errcode(ERRCODE_INTERNAL_ERROR),
            errmsg("could not generate random values")));


UUIDv7 dictates an initial 48 bits that encodes a timestamp down to millisecond precision. A millisecond is a short amount of time for a human, but quite long for a computer, and many UUIDs could easily be generated with the space of a single ms.

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      48 bits unix_ts_ms                       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|   48 bits unix_ts_ms (cont)   |  ver  |    12 bits rand_a     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|var|                    62 bits rand_b                         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     62 bits rand_b (cont)                     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


The Postgres patch solves the problem by repurposing 12 bits of the UUID’s random component to increase the precision of the timestamp down to nanosecond granularity (filling rand_a above), which in practice is too precise to contain two UUIDv7s generated in the same process. It makes a repeated UUID between processes more likely, but there’s still 62 bits of randomness left to make use of, so collisions remain vastly unlikely.

The wait is on

UUIDv7s are going to make a great core addition to Postgres, and I can’t wait to start using them. Quite unfortunately, their commit was delayed past the freeze for Postgres 17, so they won’t make it into an official version until Postgres 18 is cut in late 2025. So now, we wait.





Stripe V2
2024-12-28T23:56:24-07:00
I happened to notice by way of a Slack bot today that Stripe released a V2 version of their API. I thought this must’ve been a soft launch right before the holidays, surely to be followed up by a more formal blog post, but the Way Back Machine clocked the page in early October, making it three months old. It’s been there all along, I just hadn’t seen it before.

The V1 and V2 APIs are separate namespaces and what’s available in V2 is currently very minimal (only events and event destinations), so integrations will still use V1 for almost everything, but the overview page tells us about its aspirational design intentions.

JSON, with a sprinkling of HATEOAS

A few highlights:


By far the best and biggest change is that request bodies are sent as JSON instead of application/x-www-form-urlencoded. Form encoding isn’t the worst thing in the world, but it falls flat on its face when encoding complex data types like arrays and maps (or worse, nested arrays and maps). It’s also just weird and out of place in 2024. This change should’ve happened ten years ago.

Pagination has picked up a hypermedia-esque veneer (see HATEAOS), returning a next_page_url that’s requested directly instead of a cursor and having the caller build the next URL themselves.

The new API is trying to move away from a model where sub-objects in an API resource are expanded by default, to one where they need to be requested with an include parameter. We had plenty of discussions about this before I left. The purpose of the change is to make API requests faster (Stripe’s API is quite slow) by rendering less for most requests. I counted only two places where this is actually used so far though, so time will tell whether the gambit actually succeeds or not.

Endpoints will try for “real” idempotency where callers can converge failed operations to either success or definitive failure by calling them again:



When you provide the same idempotency key for two requests:


API v1 always returns the previously-saved response of the first API request, even if it was an error.
API v2 attempts to retry any failed requests without producing side effects (any extraneous change or observable behavior that occurs as a result of an API call) and provide an updated response.




Previously (and still for most endpoints), failures from an intermittent blip or bug were a big problem. The idempotency layer dumbly returned whatever canned response had been recorded on the initial go around (including internal server errors), so users wouldn’t get closure on what exactly happened. Their best hope would that be a Stripe engineer would eventually repair their charge manually at some later time, and send a webhook about it.


REST-ish v4-ever

Lots of positive progress there, but a new API version also presents an opportunity to clear out blemishes, and I expected to see more of that. A few points that are less good:


I was hoping they’d fix their verbs to play more nicely with modern REST conventions. Instead of using POST everywhere, use POST for endpoints that are knowingly not idempotent (without an idempotency key), PUT for mutation endpoints that are, and PATCH for mutation endpoints that aren’t. I admit it’s pedantic, but it’s so absolutely trivial to implement, and the use of a good verb signals more information than a reader would otherwise have with a cursory glance at API structure.

They’re still doing the RPC-style calls like:

POST /v2/core/event_destinations/:id/enable

Also pedantic, but enable here should theoretically be reserved for a nested resource. I think it’s cleaner to model actions as IDs under a shared “actions” subresource:

POST /v2/core/event_destinations/:id/actions/enable



Nouveau DX

Frankly, I was a bit shocked by how little attention this got. There was a time not too long ago when Stripe cutting a new API version would’ve been a major event in the tech world, but in three months I didn’t come across a single person who mentioned it.

A major part of this is that Stripe is no longer a great technical leader in the same sense that it used to be. But also, as Colin points out:


This is an undeniable sign that “a great REST API” is no longer the benchmark for great DX


That’s got to be true too. Few of us want to be making manual HTTP calls out to APIs anymore. These days a great SDK, not a great API, is a hallmark, and maybe even a necessity, of a world class development experience.



Go's maximum time.Duration
2024-12-21T10:30:01-07:00
While working on a River bug related to retry policy, I came across a case where it was actually plausible to overflow Go’s built-in time.Duration and wrap back around to negative number.

A duration has a much simpler representation than a timestamp. It’s an int64 counted in nanoseconds:

// A Duration represents the elapsed time between two instants
// as an int64 nanosecond count. The representation limits the
// largest representable duration to approximately 290 years.
type Duration int64


As the comment states, the maximum duration is about 290 years. More precisely, 292 (non-leap) years, 171 days, and 23 hours:

func main() {
    const (
        maxDuration time.Duration = 1<<63 - 1

        day  = 24 * time.Hour
        year = 365 * day
    )

    var (
        years        = maxDuration / year
        withoutYears = maxDuration % year

        days        = withoutYears / day
        withoutDays = withoutYears % day
    )

    fmt.Printf("max duration: %dy%dd%s\n", years, days, withoutDays)
}


$ go run main.go
max duration: 292y171d23h47m16.854775807s


292 years is a long time, and it’s not likely most programs will need more than that, but our retry algorithm is exponential, and crosses that threshold after 310 retries.

Compile v. runtime overflow

When performing a direct calculation on a constant, the compiler will detect the overflow:

func main() {
    const maxDuration time.Duration = 1<<63 - 1
    var maxDurationSeconds = float64(maxDuration / time.Second)

    notOverflowed := time.Duration(maxDurationSeconds) * time.Second
    fmt.Printf("not overflowed: %+v\n", notOverflowed)

    overflowed := time.Duration(int64(maxDuration)+1) * time.Second
    fmt.Printf("overflowed: %+v\n", overflowed)
}


$ go run main.go
./main.go:15:30: int64(maxDuration) + 1 (constant 9223372036854775808 of type int64) overflows int64


But performing the same operation on a variable will happily wrap around:

overflowed := time.Duration(maxDurationSeconds+1) * time.Second
fmt.Printf("overflowed: %+v\n", overflowed)


$ go run main.go
not overflowed: 2562047h47m16s
overflowed: -2562047h47m16.709551616s


Little practical use, but well defined

I fixed River’s back offs at large attempt counts by using Go 1.21’s min function combined with the maximum known number of seconds that’ll fit in a time.Duration:

// The maximum value of a duration before it overflows. About 292 years.
const maxDuration time.Duration = 1<<63 - 1

// Same as the above, but changed to a float represented in seconds.
var maxDurationSeconds = maxDuration.Seconds()

func (p *DefaultClientRetryPolicy) NextRetry(job *rivertype.JobRow) time.Time {
    return time.Now().Add(timeutil.SecondsAsDuration(
        p.retrySeconds(len(job.Errors) + 1),
    ))
}

func (p *DefaultClientRetryPolicy) retrySeconds(attempt int) float64 {
    retrySeconds := math.Pow(float64(attempt), 4)
    return min(retrySeconds, maxDurationSeconds)
}


After hitting retry attempt 310, the algorithm backs off 292 years at a time. This behavior will never be of any real use to anybody, but I changed it to be well defined behavior of no real use to anybody, with no risk of odd bugs that might otherwise result from an overflow.



ERROR: invalid byte sequence for encoding UTF8: 0x00 (and what to do about it)
2024-12-19T14:58:05-07:00
One of the oldest errors I ever remember seeing in an error tracker:


ERROR: invalid byte sequence for encoding “UTF8”: 0x00


Through my time at Heroku it was like a distant friend. Not one that you’d see every day, but one who’d appear to be surprise you a few dozen times a year. Since it didn’t seem to be causing any major fallout and I never heard a user complain about it, I’m somewhat embarrassed to say that in four years neither myself nor anyone else ever bothered to look into it.

These days, on a Go stack and with much better control and insight into any changes we make, we’re pretty aggressive about trying to prune Sentry errors down to zero. Over a few months I’d see the 0x00 error come and go, and finally decided to look into it.

The problem comes from Postgres raising an error when a caller tries to insert a text/varchar value containing a value of 0x00, or zero byte. The same value that’s used to terminate a string in plain old C. Postgres explicitly disallows it:


The character with the code zero cannot be in a string constant.


The tricky part is that although Postgres won’t take a zero byte, almost every programming language ever created will, thereby creating a natural asymmetry between database and language stack.

As far as I know, there aren’t any legitimate uses for sending a zero byte to an API or web app. Looking back through our logs, the main places I’ve seen it are from bots out on the internet, presumably using common attack patterns to probe for weaknesses, or from pentest teams that we paid to do the same.

Validating at the edges

We’re using the validate framework for Go to check that API inputs are sound, like that they’re present, below a max length, or within bounds. In a language known for its verbosity, validate annotations are succinct and quick to write.

The custom validations apistring200, apistrong2000, apistring20000, etc. are assigned to API string parameters in order of magnitude tiers. Their implementation denies \x00s that come in with request payloads:

// API strings are meant to provide a reasonable default validation
// for strings that come in via the API that aren't already
// validated more strictly. The main idea is to make sure that
// we're not getting long, unbounded input that'll either store a
// very invalid value to the database or be rejected by a DB-level
// constraint (which would bubble up as a 500 with little context).
//
// They also validate that strings contain no invalid unicode
// sequences, and that no `\x00` zero bytes are present, both of
// which Postgres will reject.
must(registerAPIString("apistring200", 200))
must(registerAPIString("apistring2000", 2_000))
must(registerAPIString("apistring20000", 20_000))
must(registerAPIString("apistring200000", 200_000))

const (
    apiStringErrorMessage = "`{0}` should be a non-empty string with a maximum length of %d characters, and contain no invalid unicode sequences or zero bytes"
)

func registerAPIString(tag string, maxLength int) error {
    if err := validate.RegisterValidation(tag, func(fl validator.FieldLevel) bool {
        val := fl.Field().String()

        if len(val) == 0 || len(val) > maxLength {
            return false
        }

        if !utf8.ValidString(val) {
            return false
        }

        // A zero (0x00) rune is valid UTF-8 and won't be caught
        // by the unicode check above, but Postgres will refuse
        // to insert it.
        if strings.Contains(val, "\x00") {
            return false
        }

        return true
    }); err != nil {
        return err
    }

    return registerTranslation(tag, fmt.Sprintf(apiStringErrorMessage, maxLength))
}


Notably, it also denies invalid UTF-8 byte sequences (\x00 is not desirable, but it is valid UTF-8), another common malformed input that internet bots like to send, and which will cause its own Postgres error.

Struct fields are tagged with validations, making use easy and concise:

// Request for creating a new account.
type AccountCreateRequest struct {
    // Full name for the new account.
    Name *string `json:"name" validate:"apistring200"`
    
    ...


Storing raw request properties

That takes care of input forms, but another place we’d see the problem is when trying to insert canonical API lines to the database for operational visibility. Even where we denied a request with invalid input with a 400, we record a canonical line for it, invalid input and all.

For this case, we take anything invalid in the input and replace it with a placeholder token that’s safely storable to Postgres:

// TrimInvalidUTF8 replaces any invalid UTF-8 or \x00 bytes with
// symbolic stand-in tokens. This lets strings that contain invalid
// UTF-8 be stored to Postgres, which normally won't tolerate
// invalid UTF-8 in string-like fields.
func TrimInvalidUTF8(s string) string {
    if !utf8.ValidString(s) {
        s = strings.ToValidUTF8(s, "[invalid UTF-8]")
    }

    // A zero (0x00) rune is valid UTF-8 and won't be caught by the
    // check above, but Postgres will refuse to insert it. Replace
    // all instances with a marker that Postgres can tolerate and
    // which is indicative of what happened. This should only ever
    // happen because of random probing from malicious internet
    // actors sending garbage into HTTP paths and what not.
    if strings.Contains(s, "\x00") {
        s = strings.ReplaceAll(s, "\x00", "[0x00 UTF-8 rune]")
    }

    return s
}


This is combined with another helper to that samples inputs longer than we’re willing to store:

// Returns a string that's been truncated the given max length and
// stripped of any invalid UTF-8 that Postgres might balk at.
// Returns an empty string on `nil` for purposes of the batch
// insert will treat empty strings as NULL.
validTruncatedStringOrEmpty := func(sPtr *string, maxLength int) string {
    if sPtr == nil {
        return ""
    }

    return stringutil.SampleLongN(stringutil.TrimInvalidUTF8(*sPtr), maxLength)
}


When inserting a canonical line for a request, inputs are sanitized and truncated. This happens for obvious fields where an invalid input can be sent like a query string or form body, but for less obvious ones as well. Invalid input can come in almost anywhere, including headers like Content-Type or User-Agent:

insertParams.ContentType[i] =
    validTruncatedStringOrEmpty(logData.ContentType, 200)
insertParams.HTTPPath[i] =
    validTruncatedStringOrEmpty(&logData.HTTPPath, 200)
insertParams.QueryString[i] =
    validTruncatedStringOrEmpty(logData.QueryString, 2000)
insertParams.UserAgent[i] =
    validTruncatedStringOrEmpty(logData.UserAgent, 200)


0x01 down

This is one of those little housekeeping tasks that may not be that important, but is quite gratifying. With the steps above we’ve eradicated “invalid byte sequence” errors, taking us a step closer to our target steady state of zero Sentry issues.



The parallel test bundle, a convention for Go testing
2024-10-27T16:06:21-07:00
A year ago we went through of process of getting every test case in our project tagged with t.Parallel and ratcheted with paralleltest. I was initially skeptical about this being worth the effort because testing across Go packages was already happening in parallel, but it turned out to be a major boon for running large packages individually where we reduced test time by 30%+. We did one more step from there to tag every subtest with t.Parallel too. The gains from that weren’t as big, but it helps when running tests with many subtests one off, and isn’t much effort to sustain now that it’s in place.

We’re running close to 5,000 tests at this point. Large scale code refactoring tools aren’t widespread in Go, so I did most of the refactoring with some very gnarly multi-line regexes, and even with those, the only reason that it was possible was that we’re obsessive with keeping strong code convention. Most test cases were structured with an identical layout, which might’ve seemed like unnecessary pedantry when it was first going in, but later paid off in reams as I refactored thousands of tests in hours instead of weeks.

Let me showcase a test convention that we’ve found to be useful for making subtests parallel-safe, keeping them DRY (unlike many languages, Go doesn’t have built-in facilities for setup/teardown blocks in tests), and keeping code readable. I try to be honest in the assessment of programming conventions and am not always certain about new ones, but we’ve been using the parallel test bundle for months and I’d rate it a ¹⁰⁄₁₀ strong recommendation. Better yet, it’s all just plain Go code and doesn’t require the adoption of anything weird/novel.

The test bundle struct

The test bundle itself is simple struct containing the object under test and useful fixtures to have available across subtests:

type testBundle struct {
    account *dbsqlc.Account
    svc     *playgroundTutorialService
    team    *dbsqlc.Team
    tx      db.Tx
}


The setup function

It’s paired with a setup helper function that returns a bundle:

setup := func(t *testing.T) (*testBundle, context.Context) {
    t.Helper()

    // These two vars are standard across almost every test case.
    var (
        ctx = ptesting.Context(t)
        tx  = ptesting.TestTx(ctx, t)
    )

    // Group of data fixtures.
    var (
        team    = dbfactory.Team(ctx, t, tx, &dbfactory.TeamOpts{})
        account = dbfactory.Account(ctx, t, tx, &dbfactory.AccountOpts{})
        _       = dbfactory.AccessGroupAccount_Admin(ctx, t, tx, team.ID, account.ID)
    )
    ctx = authntest.Account(account).Context(ctx)

    return &testBundle{
        account: account,
        svc:     pservicetest.InitAndStart(ctx, t, NewPlaygroundTutorialService(), tx.Begin, nil),
        team:    team,
        tx:      tx,
    }, ctx
}


Along with a test bundle, the function also returns a context ¹, which is useful for seeding context with a context logger that makes sure all logging output is collated with the test being run instead of stdout where its output would be interleaved with that of other tests running parallel. Tests that don’t need a context omit the second return value.

Subtest invocations

Each subtest marks itself as parallel, and calls setup to procure a test bundle:

t.Run("AllProperties", func(t *testing.T) {
    t.Parallel()

    bundle, ctx := setup(t)
    
    ...


Each instance of a test bundle is fully insulated from every other instance, ensuring that no side effects from a test can leak into any other. Every test case uses a test transaction so that it’s got its own private snapshot into the database for purposes of raising fixtures or querying.

We tend to put test bundles in every test case, even where the bundle contains only a single field. This is a courtesy to a future developer who might need to augment the test and where a preexisting test bundle makes that faster to do. It also keeps convention strong in case we need to do another broad refactor down the line.

Complete example

Here’s a full code sample with all the steps together:

func TestPlaygroundTutorialServiceCreate(t *testing.T) {
   t.Parallel()

   type testBundle struct {
      account *dbsqlc.Account
      svc     *playgroundTutorialService
      team    *dbsqlc.Team
      tx      db.Txer
   }

   setup := func(t *testing.T) (*testBundle, context.Context) {
      t.Helper()

      var (
         ctx = ptesting.Context(t)
         tx  = ptesting.TestTx(ctx, t)
      )

      var (
         team    = dbfactory.Team(ctx, t, tx, &dbfactory.TeamOpts{})
         account = dbfactory.Account(ctx, t, tx, &dbfactory.AccountOpts{})
         _       = dbfactory.AccessGroupAccount_Admin(ctx, t, tx, team.ID, account.ID)
      )
      ctx = authntest.Account(account).Context(ctx)

      return &testBundle{
         account: account,
         svc:     pservicetest.InitAndStart(ctx, t, NewPlaygroundTutorialService(), tx.Begin, nil),
         team:    team,
         tx:      tx,
      }, ctx
   }

   t.Run("AllProperties", func(t *testing.T) {
      t.Parallel()

      bundle, ctx := setup(t)

      resp, err := pservicetest.InvokeHandler(bundle.svc.Create, ctx, &PlaygroundTutorialCreateRequest{
         BootstrapSQL: ptrutil.Ptr(`SELECT unnest(array[1,2,3]);`),
         Name:         "My playground tutorial",
         Content:      "# My tutorial\n\nThis is my SQL tutorial, created by **me**.",
         IsPinned:     true,
         IsPublic:     true,
         TeamID:       eid.EID(bundle.team.ID),
         Weight:       ptrutil.Ptr(int32(100)),
      })
      require.NoError(t, err)
      prequire.PartialEqual(t, &apiresourcekind.PlaygroundTutorial{
         BootstrapSQL: ptrutil.Ptr(`SELECT unnest(array[1,2,3]);`),
         Content:      "# My tutorial\n\nThis is my SQL tutorial, created by **me**.",
         IsPinned:     true,
         IsPublic:     true,
         Name:         "My playground tutorial",
         TeamID:       eid.EID(bundle.team.ID),
         Weight:       ptrutil.Ptr(int32(100)),
      }, resp)

      _, err = dbsqlc.New().PlaygroundTutorialGetByID(ctx, bundle.tx, uuid.UUID(resp.ID))
      require.NoError(t, err)

      prequire.EventForActor(ctx, t, bundle.tx, "playground_tutorial.created", bundle.account.ID)
   })
}


See also the PartialEqual helper which I wasn’t completely sure about when I first put it in, but am now fully bought into now because it’s shown itself to be so effective at keeping many consecutive assertions very tidy.





Rails World 2024
2024-10-06T13:17:03-07:00
I attended Rails World again this year, this time in Toronto. A quick recap while it’s still fresh.

What a great event. Both this year and last the organizers went out of their way to pick some of the most incredible venues I’ve ever seen. Many places are adequate to the task of containing a conference for a few days, but few make your mouth go wide with a “wow” as you walk into the place.

This year’s was held at Evergeen Brick Works, an old factory that lapsed into a state of disrepair for many years, and later converted to event venue. Its renovators decided to keep some aspects of the previous abandoned wreck. Its roof that’d fallen in wasn’t replaced, leaving the evergreens that’d grown in the interim stretching through up into the sky (unclear what would’ve happened if it’d rained). Derelict machinery and the more tasteful graffiti had been left in place to add to the character. Meanwhile, ultra-modern acoustics and AV equipment made for excellent talks, and clashed nicely with the exposed brick.





Attention was paid to every detail. Quality drinks and delicious snacks were always on offer between sessions, and three food trucks operated all day outside (and good choices too: pizza served out of a decommissioned fire truck, beaver tails, and poutine, only Canada’s best! ¹). One of my favorite details that was a holdover from the conference’s first year is that all breakfast and lunch food is edible standing up, and served out of the same area that made up the convention floor. With few tables available, people mingle organically while eating, preventing a common conference lunch problem of groups self-siloing at tables where they stay immobile for 30+ minutes and meet few new people, if any. Organizers responded dynamically to fix problems as they arose. For example, lunch lines were too long on the first day, so by day two there were double the number of food stations. Pair programming sessions were available all day through Test Double.

This was all a nice change after attending RailsConf a few years back. There you couldn’t even get coffee outside a tight 30 minute availability window in the morning. This was understandable because money was tight. Ruby Central was spending it on more important things, like paying out $500k cancellation penalties to send a political “fuck you” to the entire state of Texas, which happily took their money and proceeded to not notice at all. (It may not be a big surprise to hear that 2025 will be the last year of RailsConf.)

DHH is pretty transparent on numbers, and was up front that Rails World operates at a loss that’s backstopped by the large companies that form Rails Foundation:


Rails Foundation, the founding core members listed above, as well as the contributing members […], were willing to happily underwrite a loss of over $100,000 on the conference itself.


I love it. This is one of the best ways for companies getting good leverage out of Ruby/Rails to give back to the community. We’re not contributing anywhere near what a colossus like Shopify is, but it felt great to have Crunchy sponsoring the event.



Tech highlights

I spent most of the conference at our booth, so I mostly only got a chance to catch the keynotes, but that was enough to catch the broad themes. A few notable highlights.

Solid Cache

Like last year, David touched upon Solid Cache. This is such a great concept: caches traditionally always needed to be memory bound using a component like memcached or Redis because memory was fast and disks were slow. Now, memory is still fast, but with modern SSDs, disk is also fast, and available in much larger denominations. 37 Signal’s products like Hey put their cache in MySQL, where they run it on a 30 TB disk with 60 days retention, and which has a 96% cache hit rate. This especially improves cache hits for the long tail of older keys that would’ve been long since the evicted given a less spacious in-memory data set.

Solid Cache also dovetails well with the single dependency stack. Three years later we still run one and exactly one persistence component: Postgres. It’s amazing just how plausible this is even for a mature stack, and it makes you realize that even the most fundamental belief systems of the programming world should be reevaluated every once in a while.

37 Signals stubbornly cargo cults Oracle products, but as Andrew covers, Solid Cache can be made workable on Postgres too. Although let me caveat that to say I’ve never done it, and suspect that there might be issues with long-lived deletion expiration queries at the scale of 30 TB of data since Postgres isn’t particularly good at efficiently deleting rows (a big reason that recent partitioning improvements are so important).

Server-phobia

For the last few months David’s been on an anti-cloud mission. One of keynote slides highlights the size, capacity and cost of a Performance M dyno (1 core/2 threads w/ 2.5GB for $250/mo.), with the next showing a rough equivalent on Hetzner (48 cores/96 threads w/ 256GB for $220/mo.), the clear message being that the Hetzner box is 50-100x more capable, and also cheaper. A big new piece of Rails is Kamal, a system that’s meant to make deployment to raw metal as simple as it is on Heroku. Kamal bundles the new Kamal Proxy, a reverse proxy that coordinates deploys, terminates TLS, and handles graceful restarts.





He’s got a point with this one. For a long time servers represented a huge capital investment and distraction from building an actual product, and in that context AWS and its ancillaries are an attractive idea. But as anyone who’s used a lot of AWS could tell you, it may be cheap in the beginning, but it’s only a matter of time until that inverts, and AWS bills become a recurring nightmare.

That said, if I were trying to send this message I’d be careful to make it clear that this is a trade off. You’re unquestionably going to save money on hardware, but you’ll spend more time on management. Someone’s also going to be the one carrying the pager for all these boxes, and presumably that’s not the 37 Signals CEO or any of its executive team.

Rails 8.1: Et tu search?

Rails 8 was released that day, and he closed the keynote by touching on some expected features for its next major release, 8.1. Next in its sights is the beast that no sane person wants to run: ElasticSearch, with the promise of bringing a sophisticated search engine into Rails itself. Also up for inclusion is “House (MD)”, which would make Markdown a more native piece of the Rails stack.

# search on any field
Post.search "announcement"

# by specific fields
Post.search title: "announcement", content: "solid search"








Twenty min

Rails World was bigger this year than last, but it’s far from a huge conference, as shown by the competitive ticketing process, where tickets were gone 20 minutes after going on sale.

Hard-to-get tickets are bad, but a positive side effect is that everyone at Rails World really wanted to be at Rails World. You don’t get there by accident. The result is that every single person you spoke to had something interesting to say. In one case I’d randomly started talking to a couple Dutch guys staying at the same hotel I was, and 15 minutes later we were talking about the trade offs of Aurora versus vanilla Postgres. This will sound self-serving, but I met quite a few people that were already familiar with this website, and they’d ask me about topics I’d written about recently like Postgres 17 bulk B-tree lookups or generating a couple secure bytes with gen_random_uuid().

I love it. The passion and expertise is the closest I’ve experienced at any event to what we used to get in the halcyon days of the early 2010s, before tech was so obviously the most important industry in the world, and became ludicrously financialized as every venture firm and Stanford graduate jumped to get a piece of it.



Unpopular opinion: Toronto

Going on its second year now, there’s a traditional announcement in the closing keynote of where the next Rails World will be held. In 2025, it’ll be back in Amsterdam, and I admit to breathing a sigh of relief (assuming I can even get in).

The Evergreen Brickworks venue is gorgeous, Shopify’s Toronto office is fabulous, and I had a good time visiting the city. But. Toronto’s downtown is enormous, and it’s the kind of place where every street, at every hour day or night, is characterized by the constant roar of total, all-encompassing, gridlock traffic. And like anywhere, when traffic is bad and tempers are heated, roads are never enough space for the pinnacle of human innovation, the automobile, and cars spill over onto every crosswalk and bike lane. With the bike lanes full, bike traffic moves onto the sidewalks, 90%+ of which is also motorized, with few riders even bothering to give lip service to those little foot rest doodads on the bottom of the bike that before the advent of the lithium battery and lightweight motor, used to be for peddling. Stop signs, red lights, and traffic priority all become the loosest of possible suggestions.

I’d be exploring an inner city suburb, with leafy canopy and the most gorgeous, stately houses that positively ooze history in all directions. Amazing! Beautiful! Except, these otherwise quiet streets are filled to the brim with hundreds of bumper-to-bumper SUVs (no self-respecting Canadian drives anything smaller than an SUV, and a family of two or more should ideally upgrade to something a little more size appropriate, like an F-350) inching their way onward at a pace only marginally faster than a brisk walk. I’d cross a bridge over a deep, forested ravine. Look over the edge, expecting to see a peaceful, bubbling brook far below. What do I see instead? A highway of course, which Torontonians have seen fit to plough through each of the city’s precious few parks.

After one of the evening parties I found myself talking to a guy who was professing his undying love for the city of Toronto. Me: what exactly do you like about it? Him: the diversityyyyyy man. Me: … okay, … anything else?

Sorry, I can’t help myself. But also, Amsterdam is the correct answer.



To recap, great event, great people. I hope to see many of you there next year.





TIL: Variables in custom VSCode snippets
2024-10-04T11:18:21-07:00
This blog is entirely driven by Markdown, TOML, and Git. Publishing an atom or sequence involves popping open a TOML file, adding a new item to the top, committing to Git, and pushing to origin to trigger a CI action that deploys the site:

[[atoms]]
  published_at = 2024-10-04T10:24:22-07:00
  description = """\
Hello, world!
"""


This generally works quite well, and in this developer’s humble opinion, far preferable to something involving a web UI with a little text box, but when I’m being honest with myself, I have to admit that the friction to editing is a little too high, and prevents me from publishing posts that I would’ve done if I was on a platform with a web UI and a little text box, like Twitter.

I’d been using VSCode snippets to speed up inserting a new TOML item, but the published_at date wasn’t automated, so I’d have to jump to a terminal, get a timestamp with date, then jump back and paste it. Not a big deal, but a little slow and mildly annoying.

I went back and RTFMed. It turns out that custom snippets support a number of built-in variables like $TM_FILENAME, $CURRENT_SECONDS_UNIX, or even $UUID for a random V4 UUID.

With a few more variables I got it to insert RFC3339 dates exactly like the ones I’d been grabbing from my terminal:

{
	"New atom": {
		"prefix": "at",
		"body": [
			"",
			"[[atoms]]",
			"  published_at = $CURRENT_YEAR-$CURRENT_MONTH-${CURRENT_DATE}T$CURRENT_HOUR:$CURRENT_MINUTE:$CURRENT_SECOND$CURRENT_TIMEZONE_OFFSET",
			"  description = \"\"\"\\",
			"$1",
			"\"\"\"",
			""
		],
		"description": "New atom"
	}
}


There’s quite a few other useful built-ins (e.g. currently selected text, contents of clipboard, start comment), and transformations with regex are supported.

I also took the time to get the whitespace around the inserted block exactly right, so no extra time is needed to correct it after insertion. All in all I probably saved myself about ten seconds for each snippet use, but it’s enough of a gain to make myself marginally more likely to do it.

Next up (hopefully): a mobile publishing workflow, something that’s been sorely missing for years.



A few secure, random bytes without `pgcrypto`
2024-09-24T11:38:37-07:00
In Postgres it’s common to see the SQL random() function used to generate a random number, but it’s a pseudo-random number generator, and not suitable for cases where real randomness is required critical. Postgres also provides a way of getting secure random numbers as well, but only through the use of the pgcrypto extension, which makes gen_random_bytes available.

Pulling pgcrypto into your database is probably fine—at least it’s a core extension that’s distributed with Postgres itself—but while testing the RC version of Postgres 17 last week, I found that it was surprisingly difficult to build Postgres against OpenSSL, which is required to build pgcrypto, thereby making pgcrypto itself hard to build.

I’m broadly against the use of Postgres extensions because they make upgrades harder and projects less portable ¹, so we have a minimal posture when it comes to them, depending only on btree_gist and pgcrypto. Like pgcrypto, btree_gist is also distributed with Postgres, but unlike pgcrypto, doesn’t have an OpenSSL dependency, making it trivial to build.

Rather than wasting more time trying to get OpenSSL configured, I did a quick code audit to find out where we were using pgcrypto, and found that we were using it in exactly one place to generate random bytes for use in a ULID:

-- 10 entropy bytes
ulid = timestamp || gen_random_bytes(10);


Needing a whole extension for generating a few random bytes seems like a waste, but unfortunately Postgres doesn’t offer a built-in way to get cryptographically secure random bytes in any other way … or does it?

Secure bytes, just not for you

Internally, Postgres has a module called pg_strong_random.c that exports a pg_strong_random() function that will use OpenSSL if available, but can fall back to /dev/urandom in case it’s not, which is perfectly fine for our purposes:

/*
 * pg_strong_random & pg_strong_random_init
 *
 * Generate requested number of random bytes. The returned bytes are
 * cryptographically secure, suitable for use e.g. in authentication.
 *
 * Before pg_strong_random is called in any process, the generator must first
 * be initialized by calling pg_strong_random_init().
 *
 * We rely on system facilities for actually generating the numbers.
 * We support a number of sources:
 *
 * 1. OpenSSL's RAND_bytes()
 * 2. Windows' CryptGenRandom() function
 * 3. /dev/urandom
 *
 * Returns true on success, and false if none of the sources
 * were available. NB: It is important to check the return value!
 * Proceeding with key generation when no random data was available
 * would lead to predictable keys and security issues.
 */


So secure randomness is available without needing to dip into OpenSSL or pgcrypto. Postgres just doesn’t make it available to you.

Roundabout randomness
 

Luckily, there’s a workaround. pg_strong_random() is called through another function that’s exported to userspace, Postgres 13’s gen_random_uuid() which generates a V4 UUID that’s secure, random data with the exception of six variant/version bits in the middle:

Datum
gen_random_uuid(PG_FUNCTION_ARGS)
{
    pg_uuid_t  *uuid = palloc(UUID_LEN);

    if (!pg_strong_random(uuid, UUID_LEN))
        ereport(ERROR,
                (errcode(ERRCODE_INTERNAL_ERROR),
                 errmsg("could not generate random values")));

    /*
     * Set magic numbers for a "version 4" (pseudorandom) UUID, see
     * http://tools.ietf.org/html/rfc4122#section-4.4
     */
    uuid->data[6] = (uuid->data[6] & 0x0f) | 0x40;    /* time_hi_and_version */
    uuid->data[8] = (uuid->data[8] & 0x3f) | 0x80;    /* clock_seq_hi_and_reserved */

    PG_RETURN_UUID_P(uuid);
}


Given our use of pgcrypto is so limited, and we only need ten random bytes at a time for a ULID, I changed our gen_ulid() implementation to find ten bytes of randomness by pulling five bytes off the front and back of a V6 UUID:

-- 10 entropy bytes
--
-- We extract these by generating a random UUID and extracting
-- the first five bytes and last bytes out of it (thus avoiding
-- versioning bits in the middle). This is a roundabout way of
-- doing this, but is done to avoid a dependency on the pgcrypto
-- extension just to get `gen_random_bytes()`.
--
-- `uuid_send()` changes `uuid` to `bytea`.
random_uuid = uuid_send(gen_random_uuid());
ulid = timestamp ||
    substring(random_uuid FROM 1 FOR 5) ||
    substring(random_uuid FROM 12 FOR 5);


Which then lets us rid ourselves of pgcrypto, along with OpenSSL:

DROP EXTENSION pgcrypto;


Making tests against a locally built version of Postgres considerably easier.

I’m hoping we can ditch this hack as soon as V7 UUIDs land in core (they didn’t make Postgres 17, which is very sad), but in the mean time, this trick might be useful to someone else.





Direnv's `source_env`, and how to manage project configuration
2024-09-20T04:53:58-07:00
For years I’ve been using Direnv to manage configuration in projects. It’s a small program that loads env vars out of an .envrc file on a directory by directory basis, using a shell hook to load vars as you enter a folder, and unload them as you leave.

A typical .envrc:

export API_URL="http://localhost:5222"
export DATABASE_URL="postgres://localhost:5432/project-db"
export ENV_NAME=dev


The beauty of Direnv is not only that it’s 12-factor friendly, but that it’s language agnostic, and unlike its language-specific alternatives that hook into program code in various creative ways, Direnv makes configuration available to your main program and anything else you need to run with it.

So configuration is available for your project’s core programs:

# gets DATABASE_URL from env
make build/api && build/api


And for all adjacent utilities, including ones that you didn’t write, and would otherwise have no way of hooking into a bespoke configuration system:

# still works fine!
goose -dir ./migrations/main postgres $DATABASE_URL


Uneven distribution

For years I’ve recommended in project READMEs to get started by copying an .envrc template and running the program:

cp .envrc.sample .envrc
direnv allow
go test ./...


.envrc.sample is committed to Git while .envrc is not due to the presumption that it may eventually be edited to include user-specific secrets.

That works fine, but has always had the downside in that if configuration changes and .envrc.sample is updated, other developers don’t get those changes unless they copy a fresh .envrc.sample, and they almost certainly won’t think to do that. This is an advantage that I’d thought language-specific configuration systems like Dotenv have had over Direnv, where they can often read multiple env files, some of which may contain shared configuration that’s versioned with the repo.

The missing piece of the puzzle: `source_env`

Well, after being a Direnv user for ten years, yesterday I learnt of the existence of source_env, a special directive that can go in an .envrc and which will read out out of another envrc file.

This simplifies the configuration of my projects dramatically. They have an .envrc.sample, but it’s stripped down to almost nothing, containing only a source_env statement and room to add customization.

# Common configuration for al developers, committed to Git.
source_env .envrc.local

# Custom env values go here.


Meanwhile, all default configuration migrates to a .envrc.local (the .local suffix not having any special meaning, but rather just a convention to use):

#
# .envrc.local
#
# Shared env vars commmitted to Git and made available to all
# developers. As # much configuration should go here as possible
# so that new env vars don't break # anyone and everyone gets to
# benefit from improvements, but don't add anything too secret or
# too custom.
#

export API_URL="http://localhost:5222"
export DATABASE_URL="postgres://localhost:5432/project-db"
export ENV_NAME=dev


.envrc.local is committed to Git, and when anyone changes configuration, all other developers get the updates the next time they pull from master.

This doesn’t account for truly sensitive configuration that shouldn’t be stored in a Git repository, but my advice on that: projects should always be able to gracefully degrade so they can run (at least in development mode) with no sensitive secrets at all. And certainly the test suite should be able to. If your project can’t do that, something is wrong.

For my money, Direnv + source_env is a perfect dev configuration system, and one that works cleanly in any language ecosystem.



Your Go version CI matrix might be wrong
2024-08-11T11:52:37-07:00
We had an unpleasant surprise this week in River’s CI suite. Since the project’s inception we thought we were supporting the latest two versions of Go (1.21 and 1.22), but it turns out that we never were.

As per common convention, we had a GitHub Actions CI matrix testing against both versions:

strategy:
  matrix:
    go-version:
      - "1.21"
      - "1.22"


That looks kosher, right? Wrong!

Builds were happily passing this whole time, but upon closer inspection of the install step, we see this:

Run actions/setup-go@v5
Setup go version spec 1.21
Found in cache @ /opt/hostedtoolcache/go/1.21.12/x64
Added go to the path
Successfully set up Go version 1.21
go: downloading go1.22.5 (linux/amd64)


GitHub Actions had been downloading Go 1.21, then immediately upgrading itself to Go 1.22.

Since Go 1.21, Go has had a built in concepts called toolchains. An installed version of Go contains its own toolchain, but has the capacity to fetch and install other toolchains as well. Usually this is convenient feature because it means you can drop into any Go project and immediately get it running with a single command with no package or version managers in sight, but it has unexpected side effects.

`go.mod` version and toolchain

Along with toolchains, Go 1.21 also changed its treatment of go directives in go.mod so that instead of an advisory requirement, they’re now a mandatory one. Any Go project needs to have its own go directive set to something at least as high as any modules it requires. So if a dependency requires Go 1.22.5, the project itself must be set to at least Go 1.22.5. Most of the time you won’t even notice this because getting a new module with go get will handle updating a project’s go directive automatically.

Given River is always a dependency, we want to provide as much leeway as possible on the minimum version bound, even while we’ll always be using more modern versions of Go. go.mod files support a go directive along with a toolchain to specify a minimum bound along with a preferred toolchain:

go 1.21

toolchain go1.22.5


Once again though, the presence of toolchain will cause CI jobs to upgrade themselves to 1.22 instead of running on the version of Go they’re supposed to be targeting. We need one more magic env var to prevent this:

env:
  # The special value "local" tells Go to use the bundled Go
  # version rather than trying to fetch one according to a
  # `toolchain` value in `go.mod`. This ensures that we're
  # really running the Go version in the CI matrix rather than
  # one that the Go command has upgraded to automatically.
  GOTOOLCHAIN: local


Take care with `go` directives

A learning from this debacle is that Go modules that expect to be dependencies need to be very careful with the go directive in go.mod because it could have considerable downstream impact.

We’re setting go 1.21 which is the same as go 1.21.0, so any project that requires River will be able to use any patch version of Go 1.21 or 1.22.

Go’s incredibly trigger happy when it comes to changing a go.mod's go version, which it will happily and silently do at any opportunity. I’m legitimately amazed that we haven’t seen more problems where dependencies accidentally upgrade to a new version of Go and break any downstream projects where that new version isn’t yet available. This could even happen where a patch version changes as a brand new Go release comes out, but isn’t yet available in everyone’s build systems.

River’s a multi-module project, and we hadn’t even intentionally updated to Go 1.22.5, which spurred the bug report that led to discovery of the issue. I think what happened is that as we added new modules with go mod init, those would get assigned the latest patch release of Go, and then as we we required those from other modules, the new versions would proliferate. We’d see the change in diffs being reviewed, but didn’t think much of it.

Along with patching all our directives to go 1.21 we’ll also be adding a CI check that verifies they all match up across modules to avoid any accidental version bumps in the future.