Disallow large file download from URLs in Python

· 2 min

I was working on a DRF POST API endpoint where the consumer is expected to add a URL containing a PDF file and the system would then download the file and save it to an S3 bucket. While this sounds quite straightforward, there’s one big issue. Before I started working on it, the core logic looked like this:

# src.py
from __future__ import annoatations

from urllib.request import urlopen
import tempfile
from shutil import copyfileobj


def save_to_s3(src_url: str, dest_url: str) -> None:
    with tempfile.NamedTemporaryFile() as file:
        with urlopen(src_url) as response:
            # This stdlib function saves the content of the file
            # in 'file'.
            copyfileobj(response, file)

        # Logic to save file in s3.
        _save_to_s3(des_url)


if __name__ == "__main__":
    save_to_s3(
        "https://citeseerx.ist.psu.edu/viewdoc/download?"
        "doi=10.1.1.92.4846&rep=rep1&type=pdf",
        "https://s3-url.com",
    )

In the above snippet, there’s no guardrail against how large the target file can be. You could bring the entire server down to its knees by posting a link to a ginormous file. The server would be busy downloading the file and keep consuming resources.

Declaratively transform data class fields in Python

· 3 min

While writing microservices in Python, I like to declaratively define the shape of the data coming in and out of JSON APIs or NoSQL databases in a separate module. Both TypedDict and dataclass are fantastic tools to communicate the shape of the data with the next person working on the codebase.

Whenever I need to do some processing on the data before starting to work on that, I prefer to transform the data via dataclasses. Consider this example:

Caching connection objects in Python

· 2 min

To avoid instantiating multiple DB connections in Python apps, a common approach is to initialize the connection objects in a module once and then import them everywhere. So, you’d do this:

# src.py
import boto3  # Pip install boto3
import redis  # Pip install redis

dynamo_client = boto3.client("dynamodb")
redis_client = redis.Redis()

However, this adds import time side effects to your module and can turn out to be expensive. In search of a better solution, my first instinct was to go for functools.lru_cache(None) to immortalize the connection objects in memory. It works like this:

How not to run a script in Python

· 2 min

When I first started working with Python, nothing stumped me more than how bizarre Python’s import system seemed to be. Often time, I wanted to run a module inside of a package with the python src/sub/module.py command, and it’d throw an ImportError that didn’t make any sense. Consider this package structure:

src
├── __init__.py
├── a.py
└── sub
    ├── __init__.py
    └── b.py

Let’s say you’re importing module a in module b:

Mocking chained methods of datetime objects in Python

· 2 min

This is the 4th time in a row that I’ve wasted time figuring out how to mock out a function during testing that calls the chained methods of a datetime.datetime object in the function body. So I thought I’d document it here. Consider this function:

# src.py
from __future__ import annotations

import datetime


def get_utcnow_isoformat() -> str:
    """Get UTCnow as an isoformat compliant string."""
    return datetime.datetime.utcnow().isoformat()

How’d you test it? Mocking out datetime.datetime is tricky because of its immutable nature. Third-party libraries like freezegun make it easier to mock and test functions like the one above. However, it’s not too difficult to cover this simple case without any additional dependencies. Here’s one way to achieve the goal:

Declarative payloads with TypedDict in Python

· 6 min

While working with microservices in Python, a common pattern that I see is - the usage of dynamically filled dictionaries as payloads of REST APIs or message queues. To understand what I mean by this, consider the following example:

# src.py
from __future__ import annotations

import json
from typing import Any

import redis  # Do a pip install.


def get_payload() -> dict[str, Any]:
    """Get the 'zoo' payload containing animal names and attributes."""

    payload = {"name": "awesome_zoo", "animals": []}

    names = ("wolf", "snake", "ostrich")
    attributes = (
        {"family": "Canidae", "genus": "Canis", "is_mammal": True},
        {"family": "Viperidae", "genus": "Boas", "is_mammal": False},
    )
    for name, attr in zip(names, attributes):
        payload["animals"].append(  # type: ignore
            {"name": name, "attribute": attr},
        )
    return payload


def save_to_cache(payload: dict[str, Any]) -> None:
    # You'll need to spin up a Redis db before instantiating
    # a connection here.
    r = redis.Redis()
    print("Saving to cache...")
    r.set(f"zoo:{payload['name']}", json.dumps(payload))


if __name__ == "__main__":
    payload = get_payload()
    save_to_cache(payload)

Here, the get_payload function constructs a payload that gets stored in a Redis DB in the save_to_cache function. The get_payload function returns a dict that denotes a contrived payload containing the data of an imaginary zoo. To execute the above snippet, you’ll need to spin up a Redis database first. You can use Docker to do so. Install and configure Docker on your system and run:

Parametrized fixtures in pytest

· 3 min

While most of my pytest fixtures don’t react to the dynamically-passed values of function parameters, there have been situations where I’ve definitely felt the need for that. Consider this example:

# test_src.py

import pytest


@pytest.fixture
def create_file(tmp_path):
    """Fixture to create a file in the tmp_path/tmp directory."""

    directory = tmp_path / "tmp"
    directory.mkdir()
    file = directory / "foo.md"  # The filename is hardcoded here!
    yield directory, file


def test_file_creation(create_file):
    """Check the fixture."""

    directory, file = create_file
    assert directory.name == "tmp"
    assert file.name == "foo.md"

Here, in the create_file fixture, I’ve created a file named foo.md in the tmp folder. Notice that the name of the file foo.md is hardcoded inside the body of the fixture function. The fixture yields the path of the directory and the created file.

Modify iterables while iterating in Python

· 4 min

If you try to mutate a sequence while traversing through it, Python usually doesn’t complain. For example:

# src.py

l = [3, 4, 56, 7, 10, 9, 6, 5]

for i in l:
    if not i % 2 == 0:
        continue
    l.remove(i)

print(l)

The above snippet iterates through a list of numbers and modifies the list l in-place to remove any even number. However, running the script prints out this:

[3, 56, 7, 9, 5]

Wait a minute! The output doesn’t look correct. The final list still contains 56 which is an even number. Why did it get skipped? Printing the members of the list while the for-loop advances reveal what’s happening inside:

Github action template for Python based projects

· 2 min

Five traits that almost all the GitHub Action workflows in my Python projects share are:

  • If a new workflow is triggered while the previous one is running, the first one will get canceled.
  • The CI is triggered every day at UTC 1.
  • Tests and the lint-checkers are run on Ubuntu and MacOS against multiple Python versions.
  • Pip dependencies are cached.
  • Dependencies, including the Actions dependencies are automatically updated via Dependabot.

I use pip-tools for managing dependencies in applications and setuptools setup.py combo for managing dependencies in libraries. Here’s an annotated version of the template action syntax:

Self type in Python

· 4 min

PEP-673 introduces the Self type and it’s coming to Python 3.11. However, you can already use that now via the typing_extensions module.

The Self type makes annotating methods that return the instances of the corresponding classes trivial. Before this, you’d have to do some mental gymnastics to statically type situations as follows:

# src.py
from __future__ import annotations

from typing import Any


class Animal:
    def __init__(self, name: str, says: str) -> None:
        self.name = name
        self.says = says

    @classmethod
    def from_description(cls, description: str = "|") -> Animal:
        descr = description.split("|")
        return cls(descr[0], descr[1])


class Dog(Animal):
    def __init__(self, *args: Any, **kwargs: Any) -> None:
        super().__init__(*args, **kwargs)

    @property
    def legs(self) -> int:
        return 4


if __name__ == "__main__":
    dog = Dog.from_description("Matt | woof")
    print(dog.legs)  # Mypy complains here!

The class Animal has a from_description class method that acts as an additional constructor. It takes a description string, and then builds and returns an instance of the same class. The return type of the method is annotated as Animal here. However, doing this makes the child class Dog conflate its identity with the Animal class. If you execute the snippet, it won’t raise any runtime error. Also, Mypy will complain about the type: