Redowan's Reflections

Apply constraints with 'assert' in Python

Whenever I need to apply some runtime constraints on a value while building an API, I usually compare the value to an expected range and raise a ValueError if it’s not within the range. For example, let’s define a function that throttles some fictitious operation. The throttle function limits the number of times an operation can be performed by specifying the throttle_after parameter. This parameter defines the number of iterations after which the operation will be halted. The current_iter parameter tracks the current number of times the operation has been performed. Here’s the implementation: ...

Automerge Dependabot PRs on GitHub

Whether I’m trying out a new tool or just prototyping with a familiar stack, I usually create a new project on GitHub and run all the experiments there. Some examples of these are: rubric: linter config initializer for Python exert: declaratively apply converter functions to class attributes hook-slinger: generic service to send, retry, and manage webhooks think-async: exploring cooperative concurrency primitives in Python epilog: container log aggregation with Elasticsearch, Kibana & Filebeat While many of these prototypes become full-fledged projects, most end up being just one-time journies. One common theme among all of these endeavors is that I always include instructions in the readme.md on how to get the project up and running—no matter how small it is. Also, I tend to configure a rudimentary CI pipeline that runs the linters and tests. GitHub Actions and Dependabot1 make it simple to configure a basic CI workflow. Dependabot keeps the dependencies fresh and makes pull requests automatically when there’s a new version of a dependency used in a project. ...

Stream process a CSV file in Python

A common bottleneck for processing large data files is—memory. Downloading the file and loading the entire content is surely the easiest way to go. However, it’s likely that you’ll quickly hit OOM errors. Often time, whenever I have to deal with large data files that need to be downloaded and processed, I prefer to stream the content line by line and use multiple processes to consume them concurrently. For example, say, you have a CSV file containing millions of rows with the following structure: ...

Bulk operations in Django with process pool

I’ve rarely been able to take advantage of Django’s bulk_create / bulk_update APIs in production applications; especially in the cases where I need to create or update multiple complex objects with a script. Often time, these complex objects trigger a chain of signals or need non-trivial setups before any operations can be performed on each of them. The issue is, bulk_create / bulk_update doesn’t trigger these signals or expose any hooks to run any setup code. The Django doc mentions these caveates1 in detail. Here are a few of them: ...

Read a CSV file from s3 without saving it to the disk

I frequently have to write ad-hoc scripts that download a CSV file from s31, do some processing on it, and then create or update objects in the production database using the parsed information from the file. In Python, it’s trivial to download any file from s3 via boto32, and then the file can be read with the csv module from the standard library. However, these scripts are usually run from a separate script server and I prefer not to clutter the server’s disk with random CSV files. Loading the s3 file directly into memory and reading its contents isn’t difficult but the process has some subtleties. I do this often enough to justify documenting the workflow here. ...

Distil git logs attached to a single file

I run git log --oneline to list out the commit logs all the time. It prints out a compact view of the git history. Running the command in this repo gives me this: d9fad76 Publish blog on safer operator.itemgetter, closes #130 0570997 Merge pull request #129 from rednafi/dependabot/... 6967f73 Bump actions/setup-python from 3 to 4 48c8634 Merge pull request #128 from rednafi/dependabot/pip/mypy-0.961 5b7a7b0 Bump mypy from 0.960 to 0.961 However, there are times when I need to list out the commit logs that only represent the changes made to a particular file. Here’s the command that does exactly that. ...

Safer 'operator.itemgetter' in Python

Python’s operator.itemgetter is quite versatile. It works on pretty much any iterables and map-like objects and allows you to fetch elements from them. The following snippet shows how you can use it to sort a list of tuples by the first element of the tuple: In [2]: from operator import itemgetter ...: ...: l = [(10, 9), (1, 3), (4, 8), (0, 55), (6, 7)] ...: l_sorted = sorted(l, key=itemgetter(0)) In [3]: l_sorted Out[3]: [(0, 55), (1, 3), (4, 8), (6, 7), (10, 9)] Here, the itemgetter callable is doing the work of selecting the first element of every tuple inside the list and then the sorted function is using those values to sort the elements. Also, this is faster than using a lambda function and passing that to the key parameter to do the sorting: ...

Guard clause and exhaustiveness checking

Nested conditionals suck. They’re hard to write and even harder to read. I’ve rarely regretted the time I’ve spent optimizing for the flattest conditional structure in my code. The following piece mimics the actions of a traffic signal: // src.ts enum Signal { YELLOW = "Yellow", RED = "Red", GREEN = "Green", } function processSignal(signal: Signal) :void { if (signal === Signal.YELLOW) { console.log("Slow down!"); } else { if (signal === Signal.RED) { console.log("Stop!"); } else { if (signal === Signal.GREEN) { console.log("Go!"); } } } } // Log processSignal(Signal.YELLOW) // prints 'Slow down!' processSignal(Signal.RED) // prints 'Stop!' The snippet above suffers from two major issues: ...

Health check a server with 'nohup $(cmd) &'

While working on a project with EdgeDB1 and FastAPI2, I wanted to perform health checks against the FastAPI server in the GitHub CI. This would notify me about the working state of the application. The idea is to: Run the server in the background. Run the commands against the server that’ll denote that the app is in a working state. Perform cleanup. Exit with code 0 if the check is successful, else exit with code 1. The following shell script demonstrates a similar workflow with a Python HTTP server. This script: ...

Return JSON error payload instead of HTML text in DRF

At my workplace, we have a large Django monolith that powers the main website and works as the primary REST API server at the same time. We use Django Rest Framework (DRF) to build and serve the API endpoints. This means, whenever there’s an error, based on the incoming request header—we’ve to return different formats of error responses to the website and API users. The default DRF configuration returns a JSON response when the system experiences an HTTP 400 (bad request) error. However, the server returns an HTML error page to the API users whenever HTTP 403 (forbidden), HTTP 404 (not found), or HTTP 500 (internal server error) occurs. This is suboptimal; JSON APIs should never return HTML text whenever something goes wrong. On the other hand, the website needs those error text to appear accordingly. ...

Decoupling producers and consumers of iterables with generators in Python

Generators can help you decouple the production and consumption of iterables—making your code more readable and maintainable. I learned this trick a few years back from David Beazley’s slides1 on generators. Consider this example: # src.py from __future__ import annotations import time from typing import NoReturn def infinite_counter(start: int, step: int) -> NoReturn: i = start while True: time.sleep(1) # Not to flood stdout print(i) i += step infinite_counter(1, 2) # Prints # 1 # 3 # 5 # ... Now, how’d you decouple the print statement from the infinite_counter? Since the function never returns, you can’t collect the outputs in an iterable, return the container, and print the elements of the iterable in another function. You might be wondering why would you even need to do it. I can think of two reasons: ...

Pre-allocated lists in Python

In CPython, elements of a list are stored as pointers to the elements rather than the values of the elements themselves. This is evident from the struct1 that represents a list in C: // Fetched from CPython main branch. Removed comments for brevity. typedef struct { PyObject_VAR_HEAD PyObject **ob_item; /* Pointer reference to the element. */ Py_ssize_t allocated; }PyListObject; An empty list builds a PyObject and occupies some memory: from sys import getsizeof l = [] print(getsizeof(l)) This returns: ...

In favor of sentence case

Up until now, I’ve always preferred Title Case to demarcate titles and section headers in my writings. However, lately I’ve realized that each time I start writing a sentence, I waste a few seconds deciding on the appropriate case of the special words like—technical terms, trademark names, proper nouns, etc—and how they’ll blend in with the multiple flavors1 of rules around title casing. Plus, often time, special casing of selected words makes title-cased sentences look strange. ...

Disallow large file download from URLs in Python

I was working on a DRF POST API endpoint where the consumer is expected to add a URL containing a PDF file and the system would then download the file and save it to an S3 bucket. While this sounds quite straightforward, there’s one big issue. Before I started working on it, the core logic looked like this: # src.py from __future__ import annoatations from urllib.request import urlopen import tempfile from shutil import copyfileobj def save_to_s3(src_url: str, dest_url: str) -> None: with tempfile.NamedTemporaryFile() as file: with urlopen(src_url) as response: # This stdlib function saves the content of the file # in 'file'. copyfileobj(response, file) # Logic to save file in s3. _save_to_s3(des_url) if __name__ == "__main__": save_to_s3( "https://citeseerx.ist.psu.edu/viewdoc/download?" "doi=10.1.1.92.4846&rep=rep1&type=pdf", "https://s3-url.com", ) In the above snippet, there’s no guardrail against how large the target file can be. You could bring the entire server down to its knees by posting a link to a ginormous file. The server would be busy downloading the file and keep consuming resources. ...

Declaratively transform data class fields in Python

While writing microservices in Python, I like to declaratively define the shape of the data coming in and out of JSON APIs or NoSQL databases in a separate module. Both TypedDict and dataclass are fantastic tools to communicate the shape of the data with the next person working on the codebase. Whenever I need to do some processing on the data before starting to work on that, I prefer to transform the data via dataclasses. Consider this example: ...

Caching connection objects in Python

To avoid instantiating multiple DB connections in Python apps, a common approach is to initialize the connection objects in a module once and then import them everywhere. So, you’d do this: # src.py import boto3 # Pip install boto3 import redis # Pip install redis dynamo_client = boto3.client("dynamodb") redis_client = redis.Redis() However, this adds import time side effects to your module and can turn out to be expensive. In search of a better solution, my first instinct was to go for functools.lru_cache(None) to immortalize the connection objects in memory. It works like this: ...

How not to run a script in Python

When I first started working with Python, nothing stumped me more than how bizarre Python’s import system seemed to be. Often time, I wanted to run a module inside of a package with the python src/sub/module.py command, and it’d throw an ImportError that didn’t make any sense. Consider this package structure: src ├── __init__.py ├── a.py └── sub ├── __init__.py └── b.py Let’s say you’re importing module a in module b: ...

Mocking chained methods of datetime objects in Python

This is the 4th time in a row that I’ve wasted time figuring out how to mock out a function during testing that calls the chained methods of a datetime.datetime object in the function body. So I thought I’d document it here. Consider this function: # src.py from __future__ import annotations import datetime def get_utcnow_isoformat() -> str: """Get UTCnow as an isoformat compliant string.""" return datetime.datetime.utcnow().isoformat() How’d you test it? Mocking out datetime.datetime is tricky because of its immutable nature. Third-party libraries like freezegun1 make it easier to mock and test functions like the one above. However, it’s not too difficult to cover this simple case without any additional dependencies. Here’s one way to achieve the goal: ...

Declarative payloads with TypedDict in Python

While working with microservices in Python, a common pattern that I see is—the usage of dynamically filled dictionaries as payloads of REST APIs or message queues. To understand what I mean by this, consider the following example: # src.py from __future__ import annotations import json from typing import Any import redis # Do a pip install. def get_payload() -> dict[str, Any]: """Get the 'zoo' payload containing animal names and attributes.""" payload = {"name": "awesome_zoo", "animals": []} names = ("wolf", "snake", "ostrich") attributes = ( {"family": "Canidae", "genus": "Canis", "is_mammal": True}, {"family": "Viperidae", "genus": "Boas", "is_mammal": False}, ) for name, attr in zip(names, attributes): payload["animals"].append( # type: ignore {"name": name, "attribute": attr}, ) return payload def save_to_cache(payload: dict[str, Any]) -> None: # You'll need to spin up a Redis db before instantiating # a connection here. r = redis.Redis() print("Saving to cache...") r.set(f"zoo:{payload['name']}", json.dumps(payload)) if __name__ == "__main__": payload = get_payload() save_to_cache(payload) Here, the get_payload function constructs a payload that gets stored in a Redis DB in the save_to_cache function. The get_payload function returns a dict that denotes a contrived payload containing the data of an imaginary zoo. To execute the above snippet, you’ll need to spin up a Redis database first. You can use Docker1 to do so. Install and configure Docker on your system and run: ...

Parametrized fixtures in pytest

While most of my pytest fixtures don’t react to the dynamically-passed values of function parameters, there have been situations where I’ve definitely felt the need for that. Consider this example: # test_src.py import pytest @pytest.fixture def create_file(tmp_path): """Fixture to create a file in the tmp_path/tmp directory.""" directory = tmp_path / "tmp" directory.mkdir() file = directory / "foo.md" # The filename is hardcoded here! yield directory, file def test_file_creation(create_file): """Check the fixture.""" directory, file = create_file assert directory.name == "tmp" assert file.name == "foo.md" Here, in the create_file fixture, I’ve created a file named foo.md in the tmp folder. Notice that the name of the file foo.md is hardcoded inside the body of the fixture function. The fixture yields the path of the directory and the created file. ...