Skipping the first part of an iterable in Python

· 3 min

Consider this iterable:

it = (1, 2, 3, 0, 4, 5, 6, 7)

Let’s say you want to build another iterable that includes only the numbers that appear starting from the element 0. Usually, I’d do this:

# This returns (0, 4, 5, 6, 7).
from_zero = tuple(elem for idx, elem in enumerate(it) if idx >= it.index(0))

While this is quite terse and does the job, it won’t work with a generator. There’s an even more generic and terser way to do the same thing with itertools.dropwhile function. Here’s how to do it:

Pausing and resuming a socket server in Python

· 5 min

I needed to write a socket server in Python that would allow me to intermittently pause the server loop for a while, run something else, then get back to the previous request-handling phase; repeating this iteration until the heat death of the universe. Initially, I opted for the low-level socket module to write something quick and dirty. However, the implementation got hairy pretty quickly. While the socket module gives you plenty of control over how you can tune the server’s behavior, writing a server with robust signal and error handling can be quite a bit of boilerplate work.

Debugging a containerized Django application in Jupyter Notebook

· 5 min

Back in the days when I was working as a data analyst, I used to spend hours inside Jupyter notebooks exploring, wrangling, and plotting data to gain insights. However, as I shifted my career gear towards backend software development, my usage of interactive exploratory tools dwindled.

Nowadays, I spend the majority of my time working on a fairly large Django monolith accompanied by a fleet of microservices. Although I love my text editor and terminal emulators, I miss the ability to just start a Jupyter Notebook server and run code snippets interactively. While Django allows you to open up a shell environment and run code snippets interactively, it still isn’t as flexible as a notebook.

Manipulating text with query expressions in Django

· 4 min

I was working with a table that had a similar (simplified) structure like this:

|               uuid               |         file_path         |
|----------------------------------|---------------------------|
| b8658dfc3e80446c92f7303edf31dcbd | media/private/file_1.pdf  |
| 3d750874a9df47388569a23c559a4561 | media/private/file_2.csv  |
| d177b7f7d8b046768ab65857451a0354 | media/private/file_3.txt  |
| df45742175d7451dad59761f15653d9d | media/private/image_1.png |
| a542966fc193470dab84351c15523042 | media/private/image_2.jpg |

Let’s say the above table is represented by the following Django model:

from django.db import models


class FileCabinet(models.Model):
    uuid = models.UUIDField(
        primary_key=True, default=uuid.uuid4, editable=False
    )
    file_path = models.FileField(upload_to="files/")

I needed to extract the file names with their extensions from the file_path column and create new paths by adding the prefix dir/ before each file name. This would involve stripping everything before the file name from a file path and adding the prefix, resulting in a list of new file paths like this: ['dir/file_1.pdf', ..., 'dir/image_2.jpg'].

Using tqdm with concurrent.fututes in Python

· 3 min

At my workplace, I was writing a script to download multiple files from different S3 buckets. The script relied on Django ORM, so I couldn’t use Python’s async paradigm to speed up the process. Instead, I opted for boto3 to download the files and concurrent.futures.ThreadPoolExecutor to spin up multiple threads and make the requests concurrently.

However, since the script was expected to be long-running, I needed to display progress bars to show the state of execution. It’s quite easy to do with tqdm when you’re just looping over a list of file paths and downloading the contents synchronously:

Faster bulk_update in Django

· 4 min

Django has a Model.objects.bulk_update method that allows you to update multiple objects in a single pass. While this method is a great way to speed up the update process, oftentimes it’s not fast enough. Recently, at my workplace, I found myself writing a script to update half a million user records and it was taking quite a bit of time to mutate them even after leveraging bulk update. So I wanted to see if I could use multiprocessing with .bulk_update to quicken the process even more. Turns out, yep I can!

Installing Python on macOS with asdf

· 3 min

I’ve just migrated from Ubuntu to macOS for work and am still in the process of setting up the machine. I’ve been a lifelong Linux user and this is the first time I’ve picked up an OS that’s not just another flavor of Debian. Primarily, I work with Python, NodeJS, and a tiny bit of Go. Previously, any time I had to install these language runtimes, I’d execute a bespoke script that’d install:

Save models with update_fields for better performance in Django

· 3 min

TIL that you can specify update_fields while saving a Django model to generate a leaner underlying SQL query. This yields better performance while updating multiple objects in a tight loop. To test that, I’m opening an IPython shell with python manage.py shell -i ipython command and creating a few user objects with the following lines:

In [1]: from django.contrib.auth import User

In [2]: for i in range(1000):
   ...:     fname, lname = f'foo_{i}', f'bar_{i}'
   ...:     User.objects.create(
   ...:         first_name=fname, last_name=lname, username=fname)
   ...:

Here’s the underlying query Django generates when you’re trying to save a single object:

Python logging quirks in AWS Lambda environment

· 4 min

At my workplace, while working on a Lambda function, I noticed that my Python logs weren’t appearing on the corresponding Cloudwatch log dashboard. At first, I thought that the function wasn’t picking up the correct log level from the environment variables. We were using Serverless framework and GitLab CI to deploy the function, so my first line of investigation involved checking for missing environment variables in those config files.

However, I quickly realized that the environment variables were being propagated to the Lambda function as expected. So, the issue had to be coming from somewhere else. After perusing through some docs, I discovered from the source code of Lambda Python Runtime Interface Client that AWS Lambda Python runtime pre-configures a logging handler that modifies the format of the log message, and also adds some metadata to the record if available. What’s not pre-configured though is the log level. This means that no matter the type of log message you try to send, it won’t print anything.

Dissecting an outage caused by eager-loading file content

· 5 min

Python makes it freakishly easy to load the whole content of any file into memory and process it afterward. This is one of the first things that’s taught to people who are new to the language. While the following snippet might be frowned upon by many, it’s definitely not uncommon:

# src.py

with open("foo.csv", "r") as f:
    # Load whole content of the file as a string in memory.
    f_content = f.read()

    # ...do your processing here.
    ...

Adopting this pattern as the default way of handling files isn’t the most terrible thing in the world for sure. Also, this is often the preferred way of dealing with image files or blobs. However, overzealously loading file content is only okay as long as the file size is smaller than the volatile memory of the working system.