December 10

Taming flaky systems w/o DDoSing yourself in Python – Safe Retries with stamina

I’ve been dabbling with Hynek’s stamina for a while. It’s a Python tool for retrying flaky service calls, built on top of the battle-tested tenacity library. It comes with a more ergonomic API, saner defaults, and some cool hooks for testing.

One neat pattern I learned from reading the source code is using a context manager in a for-loop to retry a block of code. If you’re writing a library that handles retries, you can’t add exception handling with try...except inside a user-written for-loop. Using a context manager for this is a clever trick. It allows you to write code like this:

for attempt in stamina.retry_context(on=httpx.HTTPError):
    with attempt:
        resp = httpx.get(f"https://httpbin.org/status/404")
        resp.raise_for_status()

December 08

Transactions: myths, surprises and opportunities — Martin Kleppmann

This is hands down one of the best talks I’ve seen on the topic. Martin points out that in ACID, consistency doesn’t carry the same rigid meaning as the other three constituents. It was kinda shoved in there to make the mnemonic work.

He also highlighted that, while terms like read uncommitted, read committed, snapshot isolation, and serializable are widely used to describe different isolation levels, few can recall their exact meanings off the top of their heads. This is because the names reflect implementation details from 1970s databases rather than the actual concepts. Beyond clarifying isolation levels, the talk also explores how incredibly hard it is to achieve transactions across multiple services without centralized coordination.


November 13

November ramble — Oz Nova

I generally prefer not to comment on software development practices, because of something I’ve observed often enough that it feels like a law: for every excellent engineer who swears by a particular practice, there’s an even better one who swears by the opposite. Some people couldn’t imagine coding without unit tests, or code review, or continuous integration, or step-through debugging, or [your preferred “best practice”]. Yet, there are people out there who do the exact opposite and outperform us all.


November 11

Brian Kernighan reflects on Unix: A History and a Memoir — Book Overflow

So is it possible for, you know, two or three people like us to have a really good idea and do something that does transform our world? I suspect that that still can happen. It’s different, and certainly at the time I was in, you know, early days of Unix, the world was smaller and simpler in the computing world, and so it was probably easier, and there was more low-hanging fruit. But I suspect that there’s still opportunities like that.

I think the reason Unix and all of the things that went with it worked so well was there was a big contribution from Doug’s ability to improve people’s lives so that what we did was described well as well as working well. So I think Doug in that sense would be the unsung person who didn’t get as much recognition as perhaps deserved.

— Brian Kernighan


November 10

Python’s finally gotchas

Python core developer Irit Katriel recently shared a short piece discussing a few gotchas with Python’s finally statement. I don’t think I’ve ever seen continue, break, or return statements in a finally block, but if you do use them there, avoid it, as they can lead to some unusual behavior.

The return statement in the finally block can suppress exceptions implicitly. For example:

def foo() -> int:
    try:
        1 / 0
    except Exception:
        raise
    finally:
        return 0

Running this function will suppress the exception and return 0. While this might seem surprising, it works this way because Python guarantees that the finally block will always run. This issue can be avoided by removing the finally block and dedenting the return. Similarly, continue and break behaves differently in that block.

This behavior is documented in the official docs. However, maintainers are considering making this a warning and, eventually, illegal.


November 9

Software engineering at Google

I’ve been skimming through Software Engineering at Google over the past few days. It’s available online for free, which is a nice bonus. Rather than focusing on specific technologies or operational mechanisms, the book highlights the organization-wide engineering policies that have helped Google scale. The text is sparse and, at times, quite boring, but there are definitely some gems that kept me going. Here are three interesting terms I’ve picked up so far:

Beyoncé Rule – Inspired by Beyoncé’s line, “If you liked it, then you should have put a ring on it.” If you think something’s important, write a test for it and make sure it’s part of the CI.

Chesterton’s Fence – Don’t dismantle an established practice without understanding why it exists. Consider why certain legacy systems or rules are in place before changing or removing them.

Haunted Graveyard – Parts of the codebase no one wants to touch—difficult to maintain or just feel “cursed.” They’re usually left alone because the cost to update them is high, and no one fully understands them.

I’ve always wanted to put names on these things, and now I can!


November 08

Books on engineering policies vs mechanisms

The further I got in my career, the less value I gained from books on mechanisms and more from books on policies. But policy books are boring.

My 17th book on writing better Python or Go was way more fun to read than Software Engineering at Google but yielded far less value—the age-old strategy vs. operations dichotomy.


October 27

Understanding round robin DNS

Round Robin DNS works by adding multiple IP addresses for the same domain in your DNS provider’s settings. For example, if you’re using a VPS from DigitalOcean or Hetzner, you’d add a bunch of A records for the same subdomain (like foo.yourdomain.com) and point each to a different server IP, like:

  • 203.0.113.45
  • 198.51.100.176
  • 5.62.153.87
  • 89.160.23.104

When a request comes in, the DNS resolver picks one of the IPs and sends the request to that server—basically a poor man’s load balancer. But there are some client-side quirks in how browsers pick the IPs, and this blog digs into that.

Writes and write-nots — Paul Graham

These two powerful opposing forces, the pervasive expectation of writing and the irreducible difficulty of doing it, create enormous pressure. This is why eminent professors often turn out to have resorted to plagiarism. The most striking thing to me about these cases is the pettiness of the thefts. The stuff they steal is usually the most mundane boilerplate — the sort of thing that anyone who was even halfway decent at writing could turn out with no effort at all. Which means they’re not even halfway decent at writing.


October 14

OpenTelemetry client architecture

At the highest architectural level, OpenTelemetry clients are organized into signals. Each signal provides a specialized form of observability. For example, tracing, metrics, and baggage are three separate signals. Signals share a common subsystem – context propagation – but they function independently from each other.

Each signal provides a mechanism for software to describe itself. A codebase, such as web framework or a database client, takes a dependency on various signals in order to describe itself. OpenTelemetry instrumentation code can then be mixed into the other code within that codebase. This makes OpenTelemetry a cross-cutting concern - a piece of software which is mixed into many other pieces of software in order to provide value. Cross-cutting concerns, by their very nature, violate a core design principle – separation of concerns. As a result, OpenTelemetry client design requires extra care and attention to avoid creating issues for the codebases which depend upon these cross-cutting APIs.

OpenTelemetry clients are designed to separate the portion of each signal which must be imported as cross-cutting concerns from the portions which can be managed independently. OpenTelemetry clients are also designed to be an extensible framework. To accomplish these goals, each signal consists of four types of packages: API, SDK, Semantic Conventions, and Contrib.


October 05

Private DNS with MagicDNS — Tailscale blog

Tailscale runs a DNS server built-in on every node, running at 100.100.100.100.

Yes, Tailscale on your phone includes a DNS server. (We admit that “even on your phone!” is a little silly when phones are basically supercomputers these days.)

The IP 100.100.100.100, usually pronounced “quad one hundred,” is part of the private Carrier-Grade NAT range. That means, just like IPs in the common private ranges, 192.168.1/24, 172.16/12, and 10/8, it is not routable on the public internet. So when software on your computer sends a traditional, unencrypted UDP packet to 100.100.100.100, no standard router will send it anyway.

We then tell your OS that its DNS server is 100.100.100.100. Because operating system DNS clients are largely stuck in 1987, they then forward all their DNS queries over old-school insecure UDP DNS to 100.100.100.100. Tailscale also installs a route to 100.100.100.100/32 back into Tailscale and it then hands those packets over to Tailscale’s built-in DNS server, so unencrypted queries don’t leave your device.


October 04

Git reset vs revert

I misunderstood git revert and made a mess out of my main branch today. Thought it worked like git reset—but they’re not quite the same.

Here’s the breakdown:

  • git reset --soft <commit-sha> moves the branch back to the specific commit but keeps your changes. It rewrites history, so you’ll need a force push to update the remote.

  • git revert <commit-sha> creates a new commit that undoes the changes from that commit without meddling with history. No force push needed.

Seems like revert is what you need if you accidentally merge something into main. Keeps things clean without rewriting history.


September 28

Rails World 2024 opening keynote — David Heinemeier Hansson

I was really hyped about this year’s Rails World, even though I don’t code much in Ruby or Rails. I’ve been following 37signals’ work on simplifying deployment complexity and dogfooding their own tools to show how well they work.

It’s also refreshing to see someone with more influence acknowledging that the JS ecosystem is unsustainably complex. Not everyone digs that, no matter how hip it might be. Personally, I usually have a higher tolerance for backend and infra complexity than for frontend.

Kamal 2.0 now makes it easy to deploy multiple containers behind SSL on a single VM without dealing with the usual infrastructure idiosyncrasies.

Then we have Kamal 2. This is how you’re going to get your application into the cloud, your own hardware, or any container, because we’re not tying ourselves to a PaaS. Kamal 2 levels this up substantially. It does Auto SSL through Let’s Encrypt, so you don’t even have to know anything about provisioning an SSL certificate. It allows multiple applications to run on a single server, scaling down as well as up. It comes with a simple declaration setup for detailing what your deployment looks like, encapsulated in the fewest possible pieces of information to get as close as possible to no config.

The initial trigger for me to get interested in no build for Rails 7 was an infuriating annoyance: being unable to compile a JavaScript project I had carelessly left alone for about five minutes. None of the tools worked; everything was outdated. And when I tried to update it so I could compile it again, I literally couldn’t figure it out. I spent half a day wrestling with Webpacker at the time, and I did turn over the table, saying, ‘No, I made the integration for Webpacker to Rails, and I cannot figure out how this works. There’s something deeply, fundamentally broken in that model.’ And that’s when I realized the truth: only the browser is forever.


September 25

The man who killed Google search — Edward Zitron

Every single article I’ve read about Gomes’ tenure at Google spoke of a man deeply ingrained in the foundation of one of the most important technologies ever made, who had dedicated decades to maintaining a product with a — to quote Gomes himself — “guiding light of serving the user and using technology to do that.” And when finally given the keys to the kingdom — the ability to elevate Google Search even further — he was ratfucked by a series of rotten careerists trying to please Wall Street, led by Prabhakar Raghavan.


September 23

Microservices are technical debt — Matt Ranney, Principal Engineer, Doordash

Microservices are technical debt because while they initially allow teams to move faster by working independently, they eventually create a distributed monolith, where services become so intertwined that they require excessive maintenance and coordination, slowing down future development.

The real driver for adopting microservices is not necessarily scaling traffic, but scaling teams—when too many developers are working on the same monolith, they step on each other’s toes during deployments, forcing the need for smaller, independently deployable services.

Surely at this point the comment threads are going to explode with people saying that microservices should never share databases—like, can you believe that sacrilege of having two services share the same database? How do you live with yourself?


September 22

How streaming LLM APIs work — Simon Willison

While it’s pretty easy to build a simple HTTP streaming endpoint with any basic framework and some generator-like language construct, I’ve always been curious about how production-grade streaming LLM endpoints from OpenAI, Anthropic, or Google work. It seems like they’re using a similar pattern:

All three of the APIs I investigated worked roughly the same: they return data with a content-type: text/event-stream header, which matches the server-sent events mechanism, then stream blocks separated by \r\n\r\n. Each block has a data: JSON line. Anthropic also include a event: line with an event type.

Annoyingly these can’t be directly consumed using the browser EventSource API because that only works for GET requests, and these APIs all use POST.

It seems like all of them use a somewhat compliant version of Server-Sent Events (SSE) to stream the responses.


September 17

DHH talks Apple, Linux, and running servers — How About Tomorrow

During yesterday evening’s walk, I had a lot of fun listening to DHH rant about the Apple ecosystem and the big cloud providers. I can totally get behind how so many people find deployment harder than it actually is, and how the big cloud providers are making bank off that.

We were incredibly proud that we were going to take on Gmail with a fresh new system based on thinking from 2020, not 2004, and we thought that was going to be the big boss, right? We’re going to take on Google with an actually quite good email system. But we didn’t even get to begin that fight because before a bigger boss showed up and just like Apple sat down on our chest and said, ‘Give me your—you’re going to give me your lunch money and 30% of everything you own in perpetuity going forward.’

We used to be in the cloud. We used to be on AWS. We used to be on all this stuff for a bunch of our things with Basecamp and Hey, and we yanked all of it out because cost was just getting ridiculous, and we built a bit of tooling, and now I’m on a goddamn mission to make open source as capable, as easy to use as all these AWS resellers against any box running basic Linux with an IP address you can connect to.


September 16

The many meanings of event-driven architecture — Martin Fowler

In this 2017 talk, Martin Fowler untangled a few concepts for me that often get lumped together under the event-driven umbrella. He breaks event-driven systems into four main types:

Event notification: A system sends out a signal (event) when something happens but with minimal details. Other systems receive the notification and must request more information if needed. This keeps things simple and decoupled but makes tracking harder since the event doesn’t include full data.

Event-carried state transfer: Events carry all the necessary data upfront, so no extra requests are needed. This simplifies interactions but can make events bulky and harder to manage as the system scales.

Event sourcing: Instead of storing just the current state, the system logs every event that occurs. This allows you to reconstruct the state at any time. It’s great for auditing and troubleshooting but adds complexity as log data grows.

CQRS: Commands (write operations) and queries (read operations) are handled separately, letting each be optimized on its own. It works well for complex domains but introduces more architectural overhead and needs careful planning.

Interestingly, I’ve been using the second one without knowing what it was called.


September 15

Founder Mode, hackers, and being bored by tech — Ian Betteridge

On a micro scale, I think, there’s still a lot to be excited about. But on the macro level, this VC-Founder monoculture has been stealing the thunder from what really matters—the great technology that should have been a testament to the hive mind’s ingenuity. Instead, all the attention is on the process itself.

Tech has become all Jobs and no Woz. As Dave Karpf rightly identifies, the hacker has vanished from the scene, to be replaced by an endless array of know-nothing hero founders whose main superpower is the ability to bully subordinates (and half of Twitter) into believing they are always right.


September 14

Simon Willison on the Software Misadventures podcast

I spent a delightful 2 hours this morning listening to Simon Willison talk about his creative process and how LLMs have evolved his approach.

He shared insights into how he’s become more efficient with time, writing consistently on his blog, inspired by things like Duolingo’s streak and Tom Scott’s weekly video run for a decade. Another thing I found fascinating is how he uses GitHub Issues to record every little detail of a project he’s working on. This helps him manage so many projects at once without burning out. Simon even pulled together a summary from the podcast transcript that captured some of the best bits of the discussion.

About 5 years ago, one of Simon’s tweets inspired me to start publishing my thoughts and learnings, no matter how trivial they may seem. My career has benefited immensely from that. The process of reifying your ideas and learning on paper seems daunting at first, but it gets easier over time.


September 09

Canonical log lines — Stripe Engineering Blog

I’ve been practicing this for a while but didn’t know what to call it. Canonical log lines are arbitrarily wide structured log messages that get fired off at the end of a unit of work. In a web app, you could emit a special log line tagged with different IDs and attributes at the end of every request. The benefit is that when debugging, these are the logs you’ll check first. Sifting through fewer messages and correlating them with other logs makes investigations much more effective, and the structured nature of these logs allows for easier filtering and automated analysis.

Out of all the tools and techniques we deploy to help get insight into production, canonical log lines in particular have proven to be so useful for added operational visibility and incident response that we’ve put them in almost every service we run—not only are they used in our main API, but there’s one emitted every time a webhook is sent, a credit card is tokenized by our PCI vault, or a page is loaded in the Stripe Dashboard.


September 07

Recognizing the Gell-Mann Amnesia effect in my use of LLM tools

It took time for me to recognize the Gell-Mann Amnesia effect shaping how I use LLM tools in my work. When dealing with unfamiliar tech, I’m quick to accept suggestions verbatim, but in a domain I know, the patches rarely impress and often get torn to shreds.


September 04

On the importance of ablation studies in deep learning research — François Chollet

This is true for almost any engineering effort. It’s always a good idea to ask if the design can be simplified without losing usability. Now I know there’s a name for this practice: ablation study.

The goal of research shouldn’t be merely to publish, but to generate reliable knowledge. Crucially, understanding causality in your system is the most straightforward way to generate reliable knowledge. And there’s a very low-effort way to look into causality: ablation studies. Ablation studies consist of systematically trying to remove parts of a system—making it simpler—to identify where its performance actually comes from. If you find that X + Y + Z gives you good results, also try X, Y, Z, X + Y, X + Z, and Y + Z, and see what happens.

If you become a deep learning researcher, cut through the noise in the research process: do ablation studies for your models. Always ask, “Could there be a simpler explanation? Is this added complexity really necessary? Why?


September 01

Why A.I. Isn’t Going to Make Art — Ted Chiang, The New Yorker

I indiscriminately devour almost everything Ted Chiang puts out, and this piece is no exception. It’s one of the most articulate arguments I’ve read on the sentimental value of human-generated artifacts, even when AI can make perfect knockoffs.

I’m pro-LLMs and use them to aid my work all the time. While they’re incredibly useful for a certain genre of tasks, buying into the Silicon Valley idea that these are soon going to replace every type of human-generated content is incredibly naive and redolent of the hubris within the tech bubble.

Art is notoriously hard to define, and so are the differences between good art and bad art. But let me offer a generalization: art is something that results from making a lot of choices. This might be easiest to explain if we use fiction writing as an example. When you are writing fiction, you are—consciously or unconsciously—making a choice about almost every word you type; to oversimplify, we can imagine that a ten-thousand-word short story requires something on the order of ten thousand choices. When you give a generative-A.I. program a prompt, you are making very few choices; if you supply a hundred-word prompt, you have made on the order of a hundred choices.

Generative A.I. appeals to people who think they can express themselves in a medium without actually working in that medium. But the creators of traditional novels, paintings, and films are drawn to those art forms because they see the unique expressive potential that each medium affords. It is their eagerness to take full advantage of those potentialities that makes their work satisfying, whether as entertainment or as art.

Any writing that deserves your attention as a reader is the result of effort expended by the person who wrote it. Effort during the writing process doesn’t guarantee the end product is worth reading, but worthwhile work cannot be made without it.

Some individuals have defended large language models by saying that most of what human beings say or write isn’t particularly original. That is true, but it’s also irrelevant. When someone says “I’m sorry” to you, it doesn’t matter that other people have said sorry in the past; it doesn’t matter that “I’m sorry” is a string of text that is statistically unremarkable. If someone is being sincere, their apology is valuable and meaningful, even though apologies have previously been uttered. Likewise, when you tell someone that you’re happy to see them, you are saying something meaningful, even if it lacks novelty.


August 31

How to Be a Better Reader — Tina Jordan, The NY Times

To read more deeply, to do the kind of reading that stimulates your imagination, the single most important thing to do is take your time. You can’t read deeply if you’re skimming. As the writer Zadie Smith has said, “When you practice reading, and you work at a text, it can only give you what you put into it.”

At a time when most of us read in superficial, bite-size chunks that prize quickness — texts, tweets, emails — it can be difficult to retrain your brain to read at an unhurried pace, but it is essential. In “Slow Reading in a Hurried Age,” David Mikics writes that “slow reading changes your mind the way exercise changes your body: A whole new world will open up, you will feel and act differently, because books will be more open and alive to you.”


August 26

Dark Matter — Blake Crouch

I just finished the book. It’s an emotional rollercoaster of a story, stemming from a MacGuffin that enables quantum superposition in the macro world, bringing the Copenhagen interpretation of quantum mechanics to life.

While the book starts off with a bang, it becomes a bit more predictable as the story progresses. I still enjoyed how well the author reified the probable dilemma that having access to the multiverse might pose. Highly recommened. I’m already beyond excited to read his next book, Recursion.