Redowan's Reflections

Debugging a containerized Django application in Jupyter Notebook

Back in the days when I was working as a data analyst, I used to spend hours inside Jupyter notebooks exploring, wrangling, and plotting data to gain insights. However, as I shifted my career gear towards backend software development, my usage of interactive exploratory tools dwindled. Nowadays, I spend the majority of my time working on a fairly large Django monolith accompanied by a fleet of microservices. Although I love my text editor and terminal emulators, I miss the ability to just start a Jupyter Notebook server and run code snippets interactively. While Django allows you to open up a shell environment and run code snippets interactively, it still isn’t as flexible as a notebook. ...

Manipulating text with query expressions in Django

I was working with a table that had a similar (simplified) structure like this: | uuid | file_path | |----------------------------------|---------------------------| | b8658dfc3e80446c92f7303edf31dcbd | media/private/file_1.pdf | | 3d750874a9df47388569a23c559a4561 | media/private/file_2.csv | | d177b7f7d8b046768ab65857451a0354 | media/private/file_3.txt | | df45742175d7451dad59761f15653d9d | media/private/image_1.png | | a542966fc193470dab84351c15523042 | media/private/image_2.jpg | Let’s say the above table is represented by the following Django model: from django.db import models class FileCabinet(models.Model): uuid = models.UUIDField( primary_key=True, default=uuid.uuid4, editable=False ) file_path = models.FileField(upload_to="files/") I needed to extract the file names with their extensions from the file_path column and create new paths by adding the prefix dir/ before each file name. This would involve stripping everything before the file name from a file path and adding the prefix, resulting in a list of new file paths like this: ['dir/file_1.pdf', ..., 'dir/image_2.jpg']. ...

Using tqdm with concurrent.fututes in Python

At my workplace, I was writing a script to download multiple files from different S3 buckets. The script relied on Django ORM, so I couldn’t use Python’s async paradigm to speed up the process. Instead, I opted for boto3 to download the files and concurrent.futures.ThreadPoolExecutor to spin up multiple threads and make the requests concurrently. However, since the script was expected to be long-running, I needed to display progress bars to show the state of execution. It’s quite easy to do with tqdm when you’re just looping over a list of file paths and downloading the contents synchronously: ...

Colon command in shell scripts

The colon : command is a shell utility that represents a truthy value. It can be thought of as an alias for the built-in true command. You can test it by opening a shell script and typing a colon on the command line, like this: : If you then inspect the exit code by typing $? on the command line, you’ll see a 0 there, which is exactly what you’d see if you had used the true command. ...

Faster bulk_update in Django

Django has a Model.objects.bulk_update method that allows you to update multiple objects in a single pass. While this method is a great way to speed up the update process, oftentimes it’s not fast enough. Recently, at my workplace, I found myself writing a script to update half a million user records and it was taking quite a bit of time to mutate them even after leveraging bulk update. So I wanted to see if I could use multiprocessing with .bulk_update to quicken the process even more. Turns out, yep I can! ...

Installing Python on macOS with asdf

I’ve just migrated from Ubuntu to macOS for work and am still in the process of setting up the machine. I’ve been a lifelong Linux user and this is the first time I’ve picked up an OS that’s not just another flavor of Debian. Primarily, I work with Python, NodeJS, and a tiny bit of Go. Previously, any time I had to install these language runtimes, I’d execute a bespoke script that’d install: ...

Save models with update_fields for better performance in Django

TIL that you can specify update_fields while saving a Django model to generate a leaner underlying SQL query. This yields better performance while updating multiple objects in a tight loop. To test that, I’m opening an IPython shell with python manage.py shell -i ipython command and creating a few user objects with the following lines: In [1]: from django.contrib.auth import User In [2]: for i in range(1000): ...: fname, lname = f'foo_{i}', f'bar_{i}' ...: User.objects.create( ...: first_name=fname, last_name=lname, username=f'{fname}-{lname}') ...: Here’s the underlying query Django generates when you’re trying to save a single object: ...

Python logging quirks in AWS Lambda environment

At my workplace, while working on a Lambda1 function, I noticed that my Python logs weren’t appearing on the corresponding Cloudwatch2 log dashboard. At first, I thought that the function wasn’t picking up the correct log level from the environment variables. We were using serverless3 framework and GitLab CI to deploy the function, so my first line of investigation involved checking for missing environment variables in those config files. However, I quickly realized that the environment variables were being propagated to the Lambda function as expected. So, the issue had to be coming from somewhere else. After perusing through some docs, I discovered from the source code of Lambda Python Runtime Interface Client4 that AWS Lambda Python runtime pre-configures5 a logging handler that modifies the format of the log message, and also adds some metadata to the record if available. What’s not pre-configured though is the log level. This means that no matter the type of log message you try to send, it won’t print anything. ...

Dissecting an outage caused by eager-loading file content

Python makes it freakishly easy to load the whole content of any file into memory and process it afterward. This is one of the first things that’s taught to people who are new to the language. While the following snippet might be frowned upon by many, it’s definitely not uncommon: # src.py with open("foo.csv", "r") as f: # Load the whole content of the file as a string in memory and return it. f_content = f.read() # ...do your processing here. ... Adopting this pattern as the default way of handling files isn’t the most terrible thing in the world for sure. Also, this is often the preferred way of dealing with image files or blobs. However, overzealously loading file content is only okay as long as the file size is smaller than the volatile memory of the working system. ...

Auditing commit messages on GitHub

After reading Simon Willison’s amazing piece1 on how he adds new features to his open-source softwares, I wanted to adopt some of the good practices and incorporate them into my own workflow. One of the highlights of that post was how to kick off a feature work. The process roughly goes like this: Opening a new GitHub issue for the feature in the corresponding repository. Adding a rough description of the feature to the issue. ...

To quote or not to quote

My grug1 brain can never remember the correct semantics of quoting commands and variables in a UNIX shell environment. Every time I work with a shell script or run some commands in a Docker compose file, I’ve to look up how to quote things properly to stop my ivory tower from crashing down. So, I thought I’d list out some of the most common rules that I usually look up all the time. ...

Returning values from a shell function

TIL that returning a value from a function in bash doesn’t do what I thought it does. Whenever you call a function that’s returning some value, instead of giving you the value, Bash sets the return value of the callee as the status code of the calling command. Consider this example: #!/usr/bin/bash # script.sh return_42() { return 42 } # Call the function and set the return value to a variable. value=$return_42 # Print the return value. echo $value I was expecting this to print out 42 but instead it doesn’t print anything to the console. Turns out, a shell function doesn’t return the value when it encounters the return keyword. Rather, it stops the execution of the function and sets the status code of the last command in the function as the value that the function returns. ...

Verifying webhook origin via payload hash signing

While working with GitHub webhooks, I discovered a common pattern1 a webhook receiver can adopt to verify that the incoming webhooks are indeed arriving from GitHub; not from some miscreant trying to carry out a man-in-the-middle attack. After some amount of digging, I found that it’s quite a common practice that many other webhook services employ as well. Also, check out how Sentry does it here2. Moreover, GitHub’s documentation demonstrates the pattern in Ruby. So I thought it’d be a good idea to translate that into Python in a more platform-agnostic manner. The core idea of the pattern goes as follows: ...

Recipes from Python SQLite docs

While going through the documentation of Python’s sqlite31 module, I noticed that it’s quite API-driven, where different parts of the module are explained in a prescriptive manner. I, however, learn better from examples, recipes, and narratives. Although a few good recipes already exist in the docs, I thought I’d also enlist some of the examples I tried out while grokking them. Executing individual statements To execute individual statements, you’ll need to use the cursor_obj.execute(statement) primitive. ...

Prefer urlsplit over urlparse to destructure URLs

TIL from this1 video that Python’s urllib.parse.urlparse2 is quite slow at parsing URLs. I’ve always used urlparse to destructure URLs and didn’t know that there’s a faster alternative to this in the standard library. The official documentation also recommends the alternative function. The urlparse function splits a supplied URL into multiple seperate components and returns a ParseResult object. Consider this example: In [1]: from urllib.parse import urlparse In [2]: url = "https://httpbin.org/get?q=hello&r=22" In [3]: urlparse(url) Out[3]: ParseResult( scheme='https', netloc='httpbin.org', path='/get', params='', query='q=hello&r=22', fragment='' ) You can see how the function disassembles the URL and builds a ParseResult object with the URL components. Along with this, the urlparse function can also parse an obscure type of URL that you’ll most likely never need. If you notice closely in the previous example, you’ll see that there’s a params argument in the ParseResult object. This params argument gets parsed whether you need it or not and that adds some overhead. The params field will be populated if you have a URL like this: ...

ExitStack in Python

Over the years, I’ve used Python’s contextlib.ExitStack in a few interesting ways. The official documentation1 advertises it as a way to manage multiple context managers and has a couple of examples of how to leverage it. However, neither in the docs nor in GitHub code search2 I could find examples of some of the maybe unusual ways I’ve used it in the past. So, I thought I’d document them here. ...

Compose multiple levels of fixtures in pytest

While reading the second version of Brian Okken’s pytest book1, I came across this neat trick to compose multiple levels of fixtures. Suppose, you want to create a fixture that returns some canned data from a database. Now, let’s say that invoking the fixture multiple times is expensive, and to avoid that you want to run it only once per test session. However, you still want to clear all the database states after each test function runs. Otherwise, a test might inadvertently get coupled with another test that runs before it via the fixture’s shared state. Let’s demonstrate this: ...

Patch where the object is used

I was reading Ned Bachelder’s blog “Why your mock doesn’t work”1 and it triggered an epiphany in me about a testing pattern that I’ve been using for a while without being aware that there might be an aphorism on the practice. Patch where the object is used; not where it’s defined. To understand it, consider the example below. Here, you have a module containing a function that fetches data from some fictitious database. ...

Partially assert callable arguments with 'unittest.mock.ANY'

I just found out that you can use Python’s unittest.mock.ANY to make assertions about certain arguments in a mock call, without caring about the other arguments. This can be handy if you want to test how a callable is called but only want to make assertions about some arguments. Consider the following example: # test_src.py import random import time def fetch() -> list[float]: # Simulate fetching data from a database. time.sleep(2) return [random.random() for _ in range(4)] def add(w: float, x: float, y: float, z: float) -> float: return w + x + y + z def procss() -> float: return add(*fetch()) Let’s say we only want to test the process function. But process ultimately depends on the fetch function, which has multiple side effects—it returns pseudo-random values and waits for 2 seconds on a fictitious network call. Since we only care about process, we’ll mock the other two functions. Here’s how unittest.mock.ANY can make life easier: ...

When to use 'git pull --rebase'

Whenever your local branch diverges from the remote branch, you can’t directly pull from the remote branch and merge it into the local branch. This can happen when, for example: You checkout from the main branch to work on a feature in a branch named alice. When you’re done, you merge alice into main. After that, if you try to pull the main branch from remote again and the content of the main branch changes by this time, you’ll encounter a merge error. Reproduce the issue Create a new branch named alice from main. Run: ...