Deduplicating iterables while preserving order in Python

Whenever I need to deduplicate the items of an iterable in Python, my usual approach is to create a set from the iterable and then convert it back into a list or tuple. However, this approach doesn’t preserve the original order of the items, which can be a problem if you need to keep the order unscathed. Here’s a naive approach that works: from __future__ import annotations from collections.abc import Iterable # Python >3.9 def dedup(it: Iterable) -> list: seen = set() result = [] for item in it: if item not in seen: seen.add(item) result.append(item) return result it = (2, 1, 3, 4, 66, 0, 1, 1, 1) deduped_it = dedup(it) # Gives you [2, 1, 3, 4, 66, 0] This code snippet defines a function dedup that takes an iterable it as input and returns a new list containing the unique items of the input iterable in their original order. The function uses a set seen to keep track of the items that have already been seen, and a list result to store the unique items. ...

May 1, 2023

Process substitution in Bash

I needed to compare two large directories with thousands of similarly named PDF files and find the differing filenames between them. In the first pass, this is what I did: Listed out the content of the first directory and saved it in a file: ls dir1 > dir1.txt Did the same for the second directory: ls dir2 > dir2.txt Compared the difference between the two outputs: diff dir1.txt dir2.txt This returned the name of the differing files likes this: ...

April 30, 2023

Dynamic menu with select statement in Bash

Whenever I need to whip up a quick command line tool, my go-to is usually Python. Python’s CLI solutions tend to be more robust than their Shell counterparts. However, dealing with its portability can sometimes be a hassle, especially when all you want is to distribute a simple script. That’s why while toying around with argparse to create a dynamic menu, I decided to ask ChatGPT if there’s a way to achieve the same using native shell scripting. Delightfully, it introduced me to the dead-simple select command that I probably should’ve known about years ago. But I guess better late than never! Here’s what I was trying to accomplish: ...

April 29, 2023

Simple terminal text formatting with tput

When writing shell scripts, I’d often resort to using hardcoded ANSI escape codes1 to format text, such as: #!/usr/bin/env bash BOLD="\033[1m" UNBOLD="\033[22m" FG_RED="\033[31m" BG_YELLOW="\033[43m" BG_BLUE="\033[44m" RESET="\033[0m" # Print a message in bold red text on a yellow background. echo -e "${BOLD}${FG_RED}${BG_YELLOW}This is a warning message${RESET}" # Print a message in white text on a blue background. echo -e "${BG_BLUE}This is a debug message${RESET}" This shell snippet above shows how to add text formatting and color to shell script output via ANSI escape codes. It defines a few variables that contain different escape codes for bold, unbold, foreground, and background colors. Then, we echo two log messages with different colors and formatting options. ...

April 23, 2023

Building a web app to display CSV file stats with ChatGPT & Observable

Whenever I plan to build something, I spend 90% of my time researching and figuring out the idiosyncrasies of the tools that I decide to use for the project. LLM tools like ChatGPT has helped me immensely in that regard. I’m taking on more tangential side projects because they’re no longer as time-consuming as they used to be and provide me with an immense amount of joy and learning opportunities. While LLM interfaces like ChatGPT may hallucinate, confabulate, and confidently give you misleading information, they also allow you to avoid starting from scratch when you decide to work on something. Personally, this benefits me enough to keep language models in my tool belt and use them to churn out more exploratory work at a much faster pace. ...

April 10, 2023

Pushing real-time updates to clients with Server-Sent Events (SSEs)

In multi-page web applications, a common workflow is where a user: Loads a specific page or clicks on some button that triggers a long-running task. On the server side, a background worker picks up the task and starts processing it asynchronously. The page shouldn’t reload while the task is running. The backend then communicates the status of the long-running task in real-time. Once the task is finished, the client needs to display a success or an error message depending on the final status of the finished task. The de facto tool for handling situations where real-time bidirectional communication is necessary is WebSocket1. However, in the case above, you can see that the communication is mostly unidirectional where the client initiates some action in the server and then the server continuously pushes data to the client during the lifespan of the background job. ...

April 8, 2023

Tinkering with Unix domain sockets

I’ve always had a vague idea about what Unix domain sockets are from my experience working with Docker for the past couple of years. However, lately, I’m spending more time in embedded edge environments and had to explore Unix domain sockets in a bit more detail. This is a rough documentation of what I’ve explored to gain some insights. The dry definition Unix domain sockets (UDS) are similar to TCP sockets in a way that they allow two processes to communicate with each other, but there are some core differences. While TCP sockets are used for communication over a network, Unix domain sockets are used for communication between processes running on the same computer. ...

March 11, 2023

Signal handling in a multithreaded socket server

While working on a multithreaded socket server in an embedded environment, I realized that the default behavior of Python’s socketserver.ThreadingTCPServer requires some extra work if you want to shut down the server gracefully in the presence of an interruption signal. The intended behavior here is that whenever any of SIGHUP, SIGINT, SIGTERM, or SIGQUIT signals are sent to the server, it should: Acknowledge the signal and log a message to the output console of the server. Notify all the connected clients that the server is going offline. Give the clients enough time (specified by a timeout parameter) to close the requests. Close all the client requests and then shut down the server after the timeout exceeds. Here’s a quick implementation of a multithreaded echo server and see what happens when you send SIGINT to shut down the server: ...

February 26, 2023

Switching between multiple data streams in a single thread

I was working on a project where I needed to poll multiple data sources and consume the incoming data points in a single thread. In this particular case, the two data streams were coming from two different Redis lists. The correct way to consume them would be to write two separate consumers and spin them up as different processes. However, in this scenario, I needed a simple way to poll and consume data from one data source, wait for a bit, then poll and consume from another data source, and keep doing this indefinitely. That way I could get away with doing the whole workflow in a single thread without the overhead of managing multiple processes. ...

February 19, 2023

Skipping the first part of an iterable in Python

Consider this iterable: it = (1, 2, 3, 0, 4, 5, 6, 7) Let’s say you want to build another iterable that includes only the numbers that appear starting from the element 0. Usually, I’d do this: # This returns (0, 4, 5, 6, 7). from_zero = tuple(elem for idx, elem in enumerate(it) if idx >= it.index(0)) While this is quite terse and does the job, it won’t work with a generator. There’s an even more generic and terser way to do the same thing with itertools.dropwhile function. Here’s how to do it: ...

February 12, 2023

Pausing and resuming a socket server in Python

I needed to write a socket server in Python that would allow me to intermittently pause the server loop for a while, run something else, then get back to the previous request-handling phase; repeating this iteration until the heat death of the universe. Initially, I opted for the low-level socket module to write something quick and dirty. However, the implementation got hairy pretty quickly. While the socket module gives you plenty of control over how you can tune the server’s behavior, writing a server with robust signal and error handling can be quite a bit of boilerplate work. ...

February 5, 2023

Debugging a containerized Django application in Jupyter Notebook

Back in the days when I was working as a data analyst, I used to spend hours inside Jupyter notebooks exploring, wrangling, and plotting data to gain insights. However, as I shifted my career gear towards backend software development, my usage of interactive exploratory tools dwindled. Nowadays, I spend the majority of my time working on a fairly large Django monolith accompanied by a fleet of microservices. Although I love my text editor and terminal emulators, I miss the ability to just start a Jupyter Notebook server and run code snippets interactively. While Django allows you to open up a shell environment and run code snippets interactively, it still isn’t as flexible as a notebook. ...

January 14, 2023

Manipulating text with query expressions in Django

I was working with a table that had a similar (simplified) structure like this: | uuid | file_path | |----------------------------------|---------------------------| | b8658dfc3e80446c92f7303edf31dcbd | media/private/file_1.pdf | | 3d750874a9df47388569a23c559a4561 | media/private/file_2.csv | | d177b7f7d8b046768ab65857451a0354 | media/private/file_3.txt | | df45742175d7451dad59761f15653d9d | media/private/image_1.png | | a542966fc193470dab84351c15523042 | media/private/image_2.jpg | Let’s say the above table is represented by the following Django model: from django.db import models class FileCabinet(models.Model): uuid = models.UUIDField( primary_key=True, default=uuid.uuid4, editable=False ) file_path = models.FileField(upload_to="files/") I needed to extract the file names with their extensions from the file_path column and create new paths by adding the prefix dir/ before each file name. This would involve stripping everything before the file name from a file path and adding the prefix, resulting in a list of new file paths like this: ['dir/file_1.pdf', ..., 'dir/image_2.jpg']. ...

January 7, 2023

Using tqdm with concurrent.fututes in Python

At my workplace, I was writing a script to download multiple files from different S3 buckets. The script relied on Django ORM, so I couldn’t use Python’s async paradigm to speed up the process. Instead, I opted for boto3 to download the files and concurrent.futures.ThreadPoolExecutor to spin up multiple threads and make the requests concurrently. However, since the script was expected to be long-running, I needed to display progress bars to show the state of execution. It’s quite easy to do with tqdm when you’re just looping over a list of file paths and downloading the contents synchronously: ...

January 6, 2023

Colon command in shell scripts

The colon : command is a shell utility that represents a truthy value. It can be thought of as an alias for the built-in true command. You can test it by opening a shell script and typing a colon on the command line, like this: : If you then inspect the exit code by typing $? on the command line, you’ll see a 0 there, which is exactly what you’d see if you had used the true command. ...

December 23, 2022

Faster bulk_update in Django

Django has a Model.objects.bulk_update method that allows you to update multiple objects in a single pass. While this method is a great way to speed up the update process, oftentimes it’s not fast enough. Recently, at my workplace, I found myself writing a script to update half a million user records and it was taking quite a bit of time to mutate them even after leveraging bulk update. So I wanted to see if I could use multiprocessing with .bulk_update to quicken the process even more. Turns out, yep I can! ...

November 30, 2022

Installing Python on macOS with asdf

I’ve just migrated from Ubuntu to macOS for work and am still in the process of setting up the machine. I’ve been a lifelong Linux user and this is the first time I’ve picked up an OS that’s not just another flavor of Debian. Primarily, I work with Python, NodeJS, and a tiny bit of Go. Previously, any time I had to install these language runtimes, I’d execute a bespoke script that’d install: ...

November 13, 2022

Save models with update_fields for better performance in Django

TIL that you can specify update_fields while saving a Django model to generate a leaner underlying SQL query. This yields better performance while updating multiple objects in a tight loop. To test that, I’m opening an IPython shell with python manage.py shell -i ipython command and creating a few user objects with the following lines: In [1]: from django.contrib.auth import User In [2]: for i in range(1000): ...: fname, lname = f'foo_{i}', f'bar_{i}' ...: User.objects.create( ...: first_name=fname, last_name=lname, username=f'{fname}-{lname}') ...: Here’s the underlying query Django generates when you’re trying to save a single object: ...

November 9, 2022

Python logging quirks in AWS Lambda environment

At my workplace, while working on a Lambda1 function, I noticed that my Python logs weren’t appearing on the corresponding Cloudwatch2 log dashboard. At first, I thought that the function wasn’t picking up the correct log level from the environment variables. We were using serverless3 framework and GitLab CI to deploy the function, so my first line of investigation involved checking for missing environment variables in those config files. However, I quickly realized that the environment variables were being propagated to the Lambda function as expected. So, the issue had to be coming from somewhere else. After perusing through some docs, I discovered from the source code of Lambda Python Runtime Interface Client4 that AWS Lambda Python runtime pre-configures5 a logging handler that modifies the format of the log message, and also adds some metadata to the record if available. What’s not pre-configured though is the log level. This means that no matter the type of log message you try to send, it won’t print anything. ...

October 20, 2022

Dissecting an outage caused by eager-loading file content

Python makes it freakishly easy to load the whole content of any file into memory and process it afterward. This is one of the first things that’s taught to people who are new to the language. While the following snippet might be frowned upon by many, it’s definitely not uncommon: # src.py with open("foo.csv", "r") as f: # Load the whole content of the file as a string in memory and return it. f_content = f.read() # ...do your processing here. ... Adopting this pattern as the default way of handling files isn’t the most terrible thing in the world for sure. Also, this is often the preferred way of dealing with image files or blobs. However, overzealously loading file content is only okay as long as the file size is smaller than the volatile memory of the working system. ...

October 14, 2022