Django has a Model.objects.bulk_update method that allows you to update multiple objects
in a single pass. While this method is a great way to speed up the update process,
oftentimes it’s not fast enough. Recently, at my workplace, I found myself writing a script
to update half a million user records and it was taking quite a bit of time to mutate them
even after leveraging bulk update. So I wanted to see if I could use multiprocessing with
.bulk_update to quicken the process even more. Turns out, yep I can!
Python
Installing Python on macOS with asdf
I’ve just migrated from Ubuntu to macOS for work and am still in the process of setting up the machine. I’ve been a lifelong Linux user and this is the first time I’ve picked up an OS that’s not just another flavor of Debian. Primarily, I work with Python, NodeJS, and a tiny bit of Go. Previously, any time I had to install these language runtimes, I’d execute a bespoke script that’d install:
Save models with update_fields for better performance in Django
TIL that you can specify update_fields while saving a Django model to generate a leaner
underlying SQL query. This yields better performance while updating multiple objects in a
tight loop. To test that, I’m opening an IPython shell with
python manage.py shell -i ipython command and creating a few user objects with the
following lines:
In [1]: from django.contrib.auth import User
In [2]: for i in range(1000):
...: fname, lname = f'foo_{i}', f'bar_{i}'
...: User.objects.create(
...: first_name=fname, last_name=lname, username=fname)
...:
Here’s the underlying query Django generates when you’re trying to save a single object:
Python logging quirks in AWS Lambda environment
At my workplace, while working on a Lambda function, I noticed that my Python logs weren’t appearing on the corresponding Cloudwatch log dashboard. At first, I thought that the function wasn’t picking up the correct log level from the environment variables. We were using Serverless framework and GitLab CI to deploy the function, so my first line of investigation involved checking for missing environment variables in those config files.
However, I quickly realized that the environment variables were being propagated to the Lambda function as expected. So, the issue had to be coming from somewhere else. After perusing through some docs, I discovered from the source code of Lambda Python Runtime Interface Client that AWS Lambda Python runtime pre-configures a logging handler that modifies the format of the log message, and also adds some metadata to the record if available. What’s not pre-configured though is the log level. This means that no matter the type of log message you try to send, it won’t print anything.
Dissecting an outage caused by eager-loading file content
Python makes it freakishly easy to load the whole content of any file into memory and process it afterward. This is one of the first things that’s taught to people who are new to the language. While the following snippet might be frowned upon by many, it’s definitely not uncommon:
# src.py
with open("foo.csv", "r") as f:
# Load whole content of the file as a string in memory.
f_content = f.read()
# ...do your processing here.
...
Adopting this pattern as the default way of handling files isn’t the most terrible thing in the world for sure. Also, this is often the preferred way of dealing with image files or blobs. However, overzealously loading file content is only okay as long as the file size is smaller than the volatile memory of the working system.
Verifying webhook origin via payload hash signing
While working with GitHub webhooks, I discovered a common webhook security pattern a receiver can adopt to verify that the incoming webhooks are indeed arriving from GitHub; not from some miscreant trying to carry out a man-in-the-middle attack. After some amount of digging, I found that it’s quite a common practice that many other webhook services employ as well. Also, check out how Sentry handles webhook verification.
Moreover, GitHub’s documentation demonstrates the pattern in Ruby. So I thought it’d be a good idea to translate that into Python in a more platform-agnostic manner. The core idea of the pattern goes as follows:
Recipes from Python SQLite docs
While going through the documentation of Python’s sqlite3 module, I noticed that it’s quite API-driven, where different parts of the module are explained in a prescriptive manner. I, however, learn better from examples, recipes, and narratives. Although a few good recipes already exist in the docs, I thought I’d also enlist some of the examples I tried out while grokking them.
Executing individual statements
To execute individual statements, you’ll need to use the cursor_obj.execute(statement)
primitive.
Prefer urlsplit over urlparse to destructure URLs
TIL from this video by Anthony Sottile that Python’s urlparse is quite slow at parsing
URLs. I’ve always used urlparse to destructure URLs and didn’t know that there’s a faster
alternative to this in the standard library. The official documentation also recommends the
alternative function.
The urlparse function splits a supplied URL into multiple seperate components and returns
a ParseResult object. Consider this example:
In [1]: from urllib.parse import urlparse
In [2]: url = "https://httpbin.org/get?q=hello&r=22"
In [3]: urlparse(url)
Out[3]: ParseResult(
scheme='https', netloc='httpbin.org',
path='/get', params='', query='q=hello&r=22',
fragment=''
)
You can see how the function disassembles the URL and builds a ParseResult object with the
URL components. Along with this, the urlparse function can also parse an obscure type of
URL that you’ll most likely never need. If you notice closely in the previous example,
you’ll see that there’s a params argument in the ParseResult object. This params
argument gets parsed whether you need it or not and that adds some overhead. The params
field will be populated if you have a URL like this:
ExitStack in Python
Over the years, I’ve used Python’s contextlib.ExitStack in a few interesting ways. The
official ExitStack documentation advertises it as a way to manage multiple context
managers and has a couple of examples of how to leverage it. However, neither in the docs
nor in GitHub code search could I find examples of some of the maybe unusual ways I’ve
used it in the past. So, I thought I’d document them here.
Compose multiple levels of fixtures in pytest
While reading the second version of Brian Okken’s pytest book, I came across this neat trick to compose multiple levels of fixtures. Suppose, you want to create a fixture that returns some canned data from a database. Now, let’s say that invoking the fixture multiple times is expensive, and to avoid that you want to run it only once per test session. However, you still want to clear all the database states after each test function runs. Otherwise, a test might inadvertently get coupled with another test that runs before it via the fixture’s shared state. Let’s demonstrate this: