Auditing commit messages on GitHub

After reading Simon Willison’s amazing piece1 on how he adds new features to his open-source softwares, I wanted to adopt some of the good practices and incorporate them into my own workflow. One of the highlights of that post was how to kick off a feature work. The process roughly goes like this: Opening a new GitHub issue for the feature in the corresponding repository. Adding a rough description of the feature to the issue. ...

October 6, 2022

To quote or not to quote

My grug1 brain can never remember the correct semantics of quoting commands and variables in a UNIX shell environment. Every time I work with a shell script or run some commands in a Docker compose file, I’ve to look up how to quote things properly to stop my ivory tower from crashing down. So, I thought I’d list out some of the most common rules that I usually look up all the time. ...

October 5, 2022

Returning values from a shell function

TIL that returning a value from a function in bash doesn’t do what I thought it does. Whenever you call a function that’s returning some value, instead of giving you the value, Bash sets the return value of the callee as the status code of the calling command. Consider this example: #!/usr/bin/bash # script.sh return_42() { return 42 } # Call the function and set the return value to a variable. value=$return_42 # Print the return value. echo $value I was expecting this to print out 42 but instead it doesn’t print anything to the console. Turns out, a shell function doesn’t return the value when it encounters the return keyword. Rather, it stops the execution of the function and sets the status code of the last command in the function as the value that the function returns. ...

September 25, 2022

Verifying webhook origin via payload hash signing

While working with GitHub webhooks, I discovered a common pattern1 a webhook receiver can adopt to verify that the incoming webhooks are indeed arriving from GitHub; not from some miscreant trying to carry out a man-in-the-middle attack. After some amount of digging, I found that it’s quite a common practice that many other webhook services employ as well. Also, check out how Sentry does it here2. Moreover, GitHub’s documentation demonstrates the pattern in Ruby. So I thought it’d be a good idea to translate that into Python in a more platform-agnostic manner. The core idea of the pattern goes as follows: ...

September 18, 2022

Recipes from Python SQLite docs

While going through the documentation of Python’s sqlite31 module, I noticed that it’s quite API-driven, where different parts of the module are explained in a prescriptive manner. I, however, learn better from examples, recipes, and narratives. Although a few good recipes already exist in the docs, I thought I’d also enlist some of the examples I tried out while grokking them. Executing individual statements To execute individual statements, you’ll need to use the cursor_obj.execute(statement) primitive. ...

September 11, 2022

Prefer urlsplit over urlparse to destructure URLs

TIL from this1 video that Python’s urllib.parse.urlparse2 is quite slow at parsing URLs. I’ve always used urlparse to destructure URLs and didn’t know that there’s a faster alternative to this in the standard library. The official documentation also recommends the alternative function. The urlparse function splits a supplied URL into multiple seperate components and returns a ParseResult object. Consider this example: In [1]: from urllib.parse import urlparse In [2]: url = "https://httpbin.org/get?q=hello&r=22" In [3]: urlparse(url) Out[3]: ParseResult( scheme='https', netloc='httpbin.org', path='/get', params='', query='q=hello&r=22', fragment='' ) You can see how the function disassembles the URL and builds a ParseResult object with the URL components. Along with this, the urlparse function can also parse an obscure type of URL that you’ll most likely never need. If you notice closely in the previous example, you’ll see that there’s a params argument in the ParseResult object. This params argument gets parsed whether you need it or not and that adds some overhead. The params field will be populated if you have a URL like this: ...

September 10, 2022

ExitStack in Python

Over the years, I’ve used Python’s contextlib.ExitStack in a few interesting ways. The official documentation1 advertises it as a way to manage multiple context managers and has a couple of examples of how to leverage it. However, neither in the docs nor in GitHub code search2 I could find examples of some of the maybe unusual ways I’ve used it in the past. So, I thought I’d document them here. ...

August 27, 2022

Compose multiple levels of fixtures in pytest

While reading the second version of Brian Okken’s pytest book1, I came across this neat trick to compose multiple levels of fixtures. Suppose, you want to create a fixture that returns some canned data from a database. Now, let’s say that invoking the fixture multiple times is expensive, and to avoid that you want to run it only once per test session. However, you still want to clear all the database states after each test function runs. Otherwise, a test might inadvertently get coupled with another test that runs before it via the fixture’s shared state. Let’s demonstrate this: ...

July 21, 2022

Patch where the object is used

I was reading Ned Bachelder’s blog “Why your mock doesn’t work”1 and it triggered an epiphany in me about a testing pattern that I’ve been using for a while without being aware that there might be an aphorism on the practice. Patch where the object is used; not where it’s defined. To understand it, consider the example below. Here, you have a module containing a function that fetches data from some fictitious database. ...

July 18, 2022

Partially assert callable arguments with 'unittest.mock.ANY'

I just found out that you can use Python’s unittest.mock.ANY to make assertions about certain arguments in a mock call, without caring about the other arguments. This can be handy if you want to test how a callable is called but only want to make assertions about some arguments. Consider the following example: # test_src.py import random import time def fetch() -> list[float]: # Simulate fetching data from a database. time.sleep(2) return [random.random() for _ in range(4)] def add(w: float, x: float, y: float, z: float) -> float: return w + x + y + z def procss() -> float: return add(*fetch()) Let’s say we only want to test the process function. But process ultimately depends on the fetch function, which has multiple side effects—it returns pseudo-random values and waits for 2 seconds on a fictitious network call. Since we only care about process, we’ll mock the other two functions. Here’s how unittest.mock.ANY can make life easier: ...

July 17, 2022

When to use 'git pull --rebase'

Whenever your local branch diverges from the remote branch, you can’t directly pull from the remote branch and merge it into the local branch. This can happen when, for example: You checkout from the main branch to work on a feature in a branch named alice. When you’re done, you merge alice into main. After that, if you try to pull the main branch from remote again and the content of the main branch changes by this time, you’ll encounter a merge error. Reproduce the issue Create a new branch named alice from main. Run: ...

July 14, 2022

Apply constraints with 'assert' in Python

Whenever I need to apply some runtime constraints on a value while building an API, I usually compare the value to an expected range and raise a ValueError if it’s not within the range. For example, let’s define a function that throttles some fictitious operation. The throttle function limits the number of times an operation can be performed by specifying the throttle_after parameter. This parameter defines the number of iterations after which the operation will be halted. The current_iter parameter tracks the current number of times the operation has been performed. Here’s the implementation: ...

July 10, 2022

Automerge Dependabot PRs on GitHub

Whether I’m trying out a new tool or just prototyping with a familiar stack, I usually create a new project on GitHub and run all the experiments there. Some examples of these are: rubric: linter config initializer for Python exert: declaratively apply converter functions to class attributes hook-slinger: generic service to send, retry, and manage webhooks think-async: exploring cooperative concurrency primitives in Python epilog: container log aggregation with Elasticsearch, Kibana & Filebeat While many of these prototypes become full-fledged projects, most end up being just one-time journies. One common theme among all of these endeavors is that I always include instructions in the readme.md on how to get the project up and running—no matter how small it is. Also, I tend to configure a rudimentary CI pipeline that runs the linters and tests. GitHub Actions and Dependabot1 make it simple to configure a basic CI workflow. Dependabot keeps the dependencies fresh and makes pull requests automatically when there’s a new version of a dependency used in a project. ...

July 7, 2022

Stream process a CSV file in Python

A common bottleneck for processing large data files is—memory. Downloading the file and loading the entire content is surely the easiest way to go. However, it’s likely that you’ll quickly hit OOM errors. Often time, whenever I have to deal with large data files that need to be downloaded and processed, I prefer to stream the content line by line and use multiple processes to consume them concurrently. For example, say, you have a CSV file containing millions of rows with the following structure: ...

July 1, 2022

Bulk operations in Django with process pool

I’ve rarely been able to take advantage of Django’s bulk_create / bulk_update APIs in production applications; especially in the cases where I need to create or update multiple complex objects with a script. Often time, these complex objects trigger a chain of signals or need non-trivial setups before any operations can be performed on each of them. The issue is, bulk_create / bulk_update doesn’t trigger these signals or expose any hooks to run any setup code. The Django doc mentions these caveates1 in detail. Here are a few of them: ...

June 27, 2022

Read a CSV file from s3 without saving it to the disk

I frequently have to write ad-hoc scripts that download a CSV file from s31, do some processing on it, and then create or update objects in the production database using the parsed information from the file. In Python, it’s trivial to download any file from s3 via boto32, and then the file can be read with the csv module from the standard library. However, these scripts are usually run from a separate script server and I prefer not to clutter the server’s disk with random CSV files. Loading the s3 file directly into memory and reading its contents isn’t difficult but the process has some subtleties. I do this often enough to justify documenting the workflow here. ...

June 26, 2022

Distil git logs attached to a single file

I run git log --oneline to list out the commit logs all the time. It prints out a compact view of the git history. Running the command in this repo gives me this: d9fad76 Publish blog on safer operator.itemgetter, closes #130 0570997 Merge pull request #129 from rednafi/dependabot/... 6967f73 Bump actions/setup-python from 3 to 4 48c8634 Merge pull request #128 from rednafi/dependabot/pip/mypy-0.961 5b7a7b0 Bump mypy from 0.960 to 0.961 However, there are times when I need to list out the commit logs that only represent the changes made to a particular file. Here’s the command that does exactly that. ...

June 21, 2022

Safer 'operator.itemgetter' in Python

Python’s operator.itemgetter is quite versatile. It works on pretty much any iterables and map-like objects and allows you to fetch elements from them. The following snippet shows how you can use it to sort a list of tuples by the first element of the tuple: In [2]: from operator import itemgetter ...: ...: l = [(10, 9), (1, 3), (4, 8), (0, 55), (6, 7)] ...: l_sorted = sorted(l, key=itemgetter(0)) In [3]: l_sorted Out[3]: [(0, 55), (1, 3), (4, 8), (6, 7), (10, 9)] Here, the itemgetter callable is doing the work of selecting the first element of every tuple inside the list and then the sorted function is using those values to sort the elements. Also, this is faster than using a lambda function and passing that to the key parameter to do the sorting: ...

June 16, 2022

Guard clause and exhaustiveness checking

Nested conditionals suck. They’re hard to write and even harder to read. I’ve rarely regretted the time I’ve spent optimizing for the flattest conditional structure in my code. The following piece mimics the actions of a traffic signal: // src.ts enum Signal { YELLOW = "Yellow", RED = "Red", GREEN = "Green", } function processSignal(signal: Signal) :void { if (signal === Signal.YELLOW) { console.log("Slow down!"); } else { if (signal === Signal.RED) { console.log("Stop!"); } else { if (signal === Signal.GREEN) { console.log("Go!"); } } } } // Log processSignal(Signal.YELLOW) // prints 'Slow down!' processSignal(Signal.RED) // prints 'Stop!' The snippet above suffers from two major issues: ...

May 22, 2022

Health check a server with 'nohup $(cmd) &'

While working on a project with EdgeDB1 and FastAPI2, I wanted to perform health checks against the FastAPI server in the GitHub CI. This would notify me about the working state of the application. The idea is to: Run the server in the background. Run the commands against the server that’ll denote that the app is in a working state. Perform cleanup. Exit with code 0 if the check is successful, else exit with code 1. The following shell script demonstrates a similar workflow with a Python HTTP server. This script: ...

April 18, 2022