Read a CSV file from s3 without saving it to the disk

· 4 min

I frequently have to write ad-hoc scripts that download a CSV file from AWS S3, do some processing on it, and then create or update objects in the production database using the parsed information from the file. In Python, it’s trivial to download any file from s3 via boto3, and then the file can be read with the csv module from the standard library. However, these scripts are usually run from a separate script server and I prefer not to clutter the server’s disk with random CSV files. Loading the s3 file directly into memory and reading its contents isn’t difficult but the process has some subtleties. I do this often enough to justify documenting the workflow here.

Safer 'operator.itemgetter' in Python

· 6 min

Python’s operator.itemgetter is quite versatile. It works on pretty much any iterables and map-like objects and allows you to fetch elements from them. The following snippet shows how you can use it to sort a list of tuples by the first element of the tuple:

In [2]: from operator import itemgetter
   ...:
   ...: l = [(10, 9), (1, 3), (4, 8), (0, 55), (6, 7)]
   ...: l_sorted = sorted(l, key=itemgetter(0))

In [3]: l_sorted
Out[3]: [(0, 55), (1, 3), (4, 8), (6, 7), (10, 9)]

Here, the itemgetter callable is doing the work of selecting the first element of every tuple inside the list and then the sorted function is using those values to sort the elements. Also, this is faster than using a lambda function and passing that to the key parameter to do the sorting:

Guard clause and exhaustiveness checking

· 4 min

Nested conditionals suck. They’re hard to write and even harder to read. I’ve rarely regretted the time I’ve spent optimizing for the flattest conditional structure in my code. The following piece mimics the actions of a traffic signal:

// src.ts

enum Signal {
  YELLOW = "Yellow",
  RED = "Red",
  GREEN = "Green",
}

function processSignal(signal: Signal) :void {
  if (signal === Signal.YELLOW) {
    console.log("Slow down!");
  } else {
    if (signal === Signal.RED) {
      console.log("Stop!");
    } else {
      if (signal === Signal.GREEN) {
        console.log("Go!");
      }
    }
  }
}

// Log
processSignal(Signal.YELLOW) // prints 'Slow down!'
processSignal(Signal.RED) // prints 'Stop!'

The snippet above suffers from two major issues:

Return JSON error payload instead of HTML text in DRF

· 3 min

At my workplace, we have a large Django monolith that powers the main website and works as the primary REST API server at the same time. We use Django Rest Framework (DRF) to build and serve the API endpoints. This means, whenever there’s an error, based on the incoming request header - we’ve to return different formats of error responses to the website and API users.

The default DRF configuration returns a JSON response when the system experiences an HTTP 400 (bad request) error. However, the server returns an HTML error page to the API users whenever HTTP 403 (forbidden), HTTP 404 (not found), or HTTP 500 (internal server error) occurs. This is suboptimal; JSON APIs should never return HTML text whenever something goes wrong. On the other hand, the website needs those error text to appear accordingly.

Decoupling producers and consumers of iterables with generators in Python

· 5 min

Generators can help you decouple the production and consumption of iterables - making your code more readable and maintainable. I learned this trick a few years back from David Beazley’s Generator tricks for systems programmers slides. Consider this example:

# src.py
from __future__ import annotations

import time
from typing import NoReturn


def infinite_counter(start: int, step: int) -> NoReturn:
    i = start
    while True:
        time.sleep(1)  # Not to flood stdout
        print(i)
        i += step


infinite_counter(1, 2)
# Prints
# 1
# 3
# 5
# ...

Now, how’d you decouple the print statement from the infinite_counter? Since the function never returns, you can’t collect the outputs in an iterable, return the container, and print the elements of the iterable in another function. You might be wondering why would you even need to do it. I can think of two reasons:

Pre-allocated lists in Python

· 4 min

In CPython, elements of a list are stored as pointers to the elements rather than the values of the elements themselves. This is evident from the list struct in CPython that represents a list in C:

// Fetched from CPython main branch. Removed comments for brevity.
typedef struct {

    PyObject_VAR_HEAD
    PyObject **ob_item; /* Pointer reference to the element. */
    Py_ssize_t allocated;

}PyListObject;

An empty list builds a PyObject and occupies some memory:

from sys import getsizeof

l = []

print(getsizeof(l))

This returns:

Disallow large file download from URLs in Python

· 2 min

I was working on a DRF POST API endpoint where the consumer is expected to add a URL containing a PDF file and the system would then download the file and save it to an S3 bucket. While this sounds quite straightforward, there’s one big issue. Before I started working on it, the core logic looked like this:

# src.py
from __future__ import annoatations

from urllib.request import urlopen
import tempfile
from shutil import copyfileobj


def save_to_s3(src_url: str, dest_url: str) -> None:
    with tempfile.NamedTemporaryFile() as file:
        with urlopen(src_url) as response:
            # This stdlib function saves the content of the file
            # in 'file'.
            copyfileobj(response, file)

        # Logic to save file in s3.
        _save_to_s3(des_url)


if __name__ == "__main__":
    save_to_s3(
        "https://citeseerx.ist.psu.edu/viewdoc/download?"
        "doi=10.1.1.92.4846&rep=rep1&type=pdf",
        "https://s3-url.com",
    )

In the above snippet, there’s no guardrail against how large the target file can be. You could bring the entire server down to its knees by posting a link to a ginormous file. The server would be busy downloading the file and keep consuming resources.

Declaratively transform data class fields in Python

· 3 min

While writing microservices in Python, I like to declaratively define the shape of the data coming in and out of JSON APIs or NoSQL databases in a separate module. Both TypedDict and dataclass are fantastic tools to communicate the shape of the data with the next person working on the codebase.

Whenever I need to do some processing on the data before starting to work on that, I prefer to transform the data via dataclasses. Consider this example:

Caching connection objects in Python

· 2 min

To avoid instantiating multiple DB connections in Python apps, a common approach is to initialize the connection objects in a module once and then import them everywhere. So, you’d do this:

# src.py
import boto3  # Pip install boto3
import redis  # Pip install redis

dynamo_client = boto3.client("dynamodb")
redis_client = redis.Redis()

However, this adds import time side effects to your module and can turn out to be expensive. In search of a better solution, my first instinct was to go for functools.lru_cache(None) to immortalize the connection objects in memory. It works like this:

How not to run a script in Python

· 2 min

When I first started working with Python, nothing stumped me more than how bizarre Python’s import system seemed to be. Often time, I wanted to run a module inside of a package with the python src/sub/module.py command, and it’d throw an ImportError that didn’t make any sense. Consider this package structure:

src
├── __init__.py
├── a.py
└── sub
    ├── __init__.py
    └── b.py

Let’s say you’re importing module a in module b: