Pausing and resuming a socket server in Python

I needed to write a socket server in Python that would allow me to intermittently pause the server loop for a while, run something else, then get back to the previous request-handling phase; repeating this iteration until the heat death of the universe. Initially, I opted for the low-level socket module to write something quick and dirty. However, the implementation got hairy pretty quickly. While the socket module gives you plenty of control over how you can tune the server’s behavior, writing a server with robust signal and error handling can be quite a bit of boilerplate work....

February 5, 2023

Debugging a containerized Django application in Jupyter Notebook

Back in the days when I was working as a data analyst, I used to spend hours inside Jupyter notebooks exploring, wrangling, and plotting data to gain insights. However, as I shifted my career gear towards backend software development, my usage of interactive exploratory tools dwindled. Nowadays, I spend the majority of my time working on a fairly large Django monolith accompanied by a fleet of microservices. Although I love my text editor and terminal emulators, I miss the ability to just start a Jupyter Notebook server and run code snippets interactively....

January 14, 2023

Manipulating text with query expressions in Django

I was working with a table that had a similar (simplified) structure like this: | uuid | file_path | |----------------------------------|---------------------------| | b8658dfc3e80446c92f7303edf31dcbd | media/private/file_1.pdf | | 3d750874a9df47388569a23c559a4561 | media/private/file_2.csv | | d177b7f7d8b046768ab65857451a0354 | media/private/file_3.txt | | df45742175d7451dad59761f15653d9d | media/private/image_1.png | | a542966fc193470dab84351c15523042 | media/private/image_2.jpg | Let’s say the above table is represented by the following Django model: from django.db import models class FileCabinet(models.Model): uuid = models.UUIDField( primary_key=True, default=uuid.uuid4, editable=False ) file_path = models....

January 7, 2023

Using tqdm with concurrent.fututes in Python

At my workplace, I was writing a script to download multiple files from different S3 buckets. The script relied on Django ORM, so I couldn’t use Python’s async paradigm to speed up the process. Instead, I opted for boto3 to download the files and concurrent.futures.ThreadPoolExecutor to spin up multiple threads and make the requests concurrently. However, since the script was expected to be long-running, I needed to display progress bars to show the state of execution....

January 6, 2023

Faster bulk_update in Django

Django has a Model.objects.bulk_update method that allows you to update multiple objects in a single pass. While this method is a great way to speed up the update process, oftentimes it’s not fast enough. Recently, at my workplace, I found myself writing a script to update half a million user records and it was taking quite a bit of time to mutate them even after leveraging bulk update. So I wanted to see if I could use multiprocessing with ....

November 30, 2022

Installing Python on macOS with asdf

I’ve just migrated from Ubuntu to macOS for work and am still in the process of setting up the machine. I’ve been a lifelong Linux user and this is the first time I’ve picked up an OS that’s not just another flavor of Debian. Primarily, I work with Python, NodeJS, and a tiny bit of Go. Previously, any time I had to install these language runtimes, I’d execute a bespoke script that’d install:...

November 13, 2022

Save models with update_fields for better performance in Django

TIL that you can specify update_fields while saving a Django model to generate a leaner underlying SQL query. This yields better performance while updating multiple objects in a tight loop. To test that, I’m opening an IPython shell with python manage.py shell -i ipython command and creating a few user objects with the following lines: In [1]: from django.contrib.auth import User In [2]: for i in range(1000): ...: fname, lname = f'foo_{i}', f'bar_{i}' ....

November 9, 2022

Python logging quirks in AWS Lambda environment

At my workplace, while working on a Lambda1 function, I noticed that my Python logs weren’t appearing on the corresponding Cloudwatch2 log dashboard. At first, I thought that the function wasn’t picking up the correct log level from the environment variables. We were using serverless3 framework and GitLab CI to deploy the function, so my first line of investigation involved checking for missing environment variables in those config files. However, I quickly realized that the environment variables were being propagated to the Lambda function as expected....

October 20, 2022

Dissecting an outage caused by eager-loading file content

Python makes it freakishly easy to load the whole content of any file into memory and process it afterward. This is one of the first things that’s taught to people who are new to the language. While the following snippet might be frowned upon by many, it’s definitely not uncommon: # src.py with open("foo.csv", "r") as f: # Load the whole content of the file as a string in memory and return it....

October 14, 2022

Verifying webhook origin via payload hash signing

While working with GitHub webhooks, I discovered a common pattern1 a webhook receiver can adopt to verify that the incoming webhooks are indeed arriving from GitHub; not from some miscreant trying to carry out a man-in-the-middle attack. After some amount of digging, I found that it’s quite a common practice that many other webhook services employ as well. Also, check out how Sentry does it here2. Moreover, GitHub’s documentation demonstrates the pattern in Ruby....

September 18, 2022

Recipes from Python SQLite docs

While going through the documentation of Python’s sqlite31 module, I noticed that it’s quite API-driven, where different parts of the module are explained in a prescriptive manner. I, however, learn better from examples, recipes, and narratives. Although a few good recipes already exist in the docs, I thought I’d also enlist some of the examples I tried out while grokking them. Executing individual statements To execute individual statements, you’ll need to use the cursor_obj....

September 11, 2022

Prefer urlsplit over urlparse to destructure URLs

TIL from this1 video that Python’s urllib.parse.urlparse2 is quite slow at parsing URLs. I’ve always used urlparse to destructure URLs and didn’t know that there’s a faster alternative to this in the standard library. The official documentation also recommends the alternative function. The urlparse function splits a supplied URL into multiple seperate components and returns a ParseResult object. Consider this example: In [1]: from urllib.parse import urlparse In [2]: url = "https://httpbin....

September 10, 2022

ExitStack in Python

Over the years, I’ve used Python’s contextlib.ExitStack in a few interesting ways. The official documentation1 advertises it as a way to manage multiple context managers and has a couple of examples of how to leverage it. However, neither in the docs nor in GitHub code search2 I could find examples of some of the maybe unusual ways I’ve used it in the past. So, I thought I’d document them here....

August 27, 2022

Compose multiple levels of fixtures in pytest

While reading the second version of Brian Okken’s pytest book1, I came across this neat trick to compose multiple levels of fixtures. Suppose, you want to create a fixture that returns some canned data from a database. Now, let’s say that invoking the fixture multiple times is expensive, and to avoid that you want to run it only once per test session. However, you still want to clear all the database states after each test function runs....

July 21, 2022

Patch where the object is used

I was reading Ned Bachelder’s blog “Why your mock doesn’t work”1 and it triggered an epiphany in me about a testing pattern that I’ve been using for a while without being aware that there might be an aphorism on the practice. Patch where the object is used; not where it’s defined. To understand it, consider the example below. Here, you have a module containing a function that fetches data from some fictitious database....

July 18, 2022

Partially assert callable arguments with 'unittest.mock.ANY'

I just found out that you can use Python’s unittest.mock.ANY to make assertions about certain arguments in a mock call, without caring about the other arguments. This can be handy if you want to test how a callable is called but only want to make assertions about some arguments. Consider the following example: # test_src.py import random import time def fetch() -> list[float]: # Simulate fetching data from a database. time....

July 17, 2022

Apply constraints with 'assert' in Python

Whenever I need to apply some runtime constraints on a value while building an API, I usually compare the value to an expected range and raise a ValueError if it’s not within the range. For example, let’s define a function that throttles some fictitious operation. The throttle function limits the number of times an operation can be performed by specifying the throttle_after parameter. This parameter defines the number of iterations after which the operation will be halted....

July 10, 2022

Stream process a CSV file in Python

A common bottleneck for processing large data files is—memory. Downloading the file and loading the entire content is surely the easiest way to go. However, it’s likely that you’ll quickly hit OOM errors. Often time, whenever I have to deal with large data files that need to be downloaded and processed, I prefer to stream the content line by line and use multiple processes to consume them concurrently. For example, say, you have a CSV file containing millions of rows with the following structure:...

July 1, 2022

Bulk operations in Django with process pool

I’ve rarely been able to take advantage of Django’s bulk_create / bulk_update APIs in production applications; especially in the cases where I need to create or update multiple complex objects with a script. Often time, these complex objects trigger a chain of signals or need non-trivial setups before any operations can be performed on each of them. The issue is, bulk_create / bulk_update doesn’t trigger these signals or expose any hooks to run any setup code....

June 27, 2022

Read a CSV file from s3 without saving it to the disk

I frequently have to write ad-hoc scripts that download a CSV file from s31, do some processing on it, and then create or update objects in the production database using the parsed information from the file. In Python, it’s trivial to download any file from s3 via boto32, and then the file can be read with the csv module from the standard library. However, these scripts are usually run from a separate script server and I prefer not to clutter the server’s disk with random CSV files....

June 26, 2022