Python makes it freakishly easy to load the whole content of any file into memory and process it afterward. This is one of the first things that’s taught to people who are new to the language. While the following snippet might be frowned upon by many, it’s definitely not uncommon:
# src.py with open("foo.csv", "r") as f: # Load the whole content of the file as a string in memory and return it. f_content = f.read() # ...do your processing here. ...
Adopting this pattern as the default way of handling files isn’t the most terrible thing in the world for sure. Also, this is often the preferred way of dealing with image files or blobs. However, overzealously loading file content is only okay as long as the file size is smaller than the volatile memory of the working system.
Moreover, you’ll need to be extra careful if you’re accepting files from users and running further procedures on the content of those files. Indiscriminantly loading up the full content into memory can be dangerous as it can cause OOM errors and crash the working process if the system runs out of memory while processing a large file. This simple overlook was the root cause of a major production incident at my workplace today.
The affected part of our primary Django monolith asks the users to upload a CSV file to a panel, runs some procedures on the content of the file, and displays the transformed rows in a paginated HTML table. Since the application is primarily used by authenticated users and we knew the expected file size, there wasn’t any guardrail that’d prevent someone from uploading a humongous file and crashing down the whole system. To make things worse, the associated background function in the Django view was buffering the entire file into memory before starting to process the rows. Buffering the entire file surely makes the process a little faster but at the cost of higher memory usage.
Although we were using background processes to avoid chugging files in the main server
process, that didn’t help when the users suddendly started to send large CSV files in
parallel. The workers were hitting OOM errors and getting restarted by the process manager.
In our particular case, we didn’t have much reason to buffer the whole file before
processing. Apparently, the naive way scaled up pretty gracefully and we didn’t pay much
attention since no one was uploading file that our server instances couln’t handle. We were
storing the incoming file in a
models.FileField type attribute of a Django model. When a
user uploads a CSV file, we’d:
- Open the file in binary mode via the
- Buffer the whole file in memory and transform the binary content into a unicode string.
- Pass the stringified file-like object to
csv.DictReaderto load that as a CSV file.
- Apply transformation on the rows line by line and render the HTML table.
This is how the code looks:
# src.py import csv import io # Django mandates us to open the file in binary mode. with model_instance.file.open(mode="rb") as f: reader = csv.DictReader( io.StringIO(f.read().decode(errors="ignore", encoding="utf-8")), ) with row in reader: # ... data processing goes here.
csv.DictReader callable only accepts a file-like object that’s been opened in text
mode. However, Django’s
FileField type doesn’t make any assumptions about the file
content. It mandates us to open the file in binary mode and then decode it if necessary. So,
we open the file in binary mode with
model_instance.file.open(mode="rb") which returns an
io.BufferedReader type file object. This file-like object can’t be passed directly to the
csv.DictReader because a byte stream doesn’t have the concept of EOL and the CSV reader
need that to know where a row ends. As a consequence, the
csv.DictReader expects a
file-like object opened in text mode where the rows are explicitly delineated by
platform-specific EOLs like
To solve this, we load the content of the file in memory with
f.read() and decode it by
.decode() on the result of the preceding operation. Then we create an in-memory
text file-like buffer by passing the decoded string to
io.StringIO. Now the CSV reader can
consume this transformed file-like object and build dictionaries of rows off of that.
Unfortunately, this stringified file buffer stays alive in the memory throughout the entire
lifetime of the processor function. Imagine 100s of large CSV files getting thrown at the
workers that execute the above code snippet. You see, at this point, overwhelming the
background workers doesn’t seem too difficult.
When our workers started to degrade in production and the alerts went bonkers, we began investigating the problem. After pinpointing the issue, we immediately responded to it by vertically scaling up the machines. The surface area of this issue was quite large and we didn’t want to hotfix it in fear of triggering inadvertent regressions. Once we were out of the woods, we started patching the culprit.
The solution to this is quite simple—convert the binary file-like object into a text
file-like object without buffering everything in memory and then pass the file to the CSV
reader. We were already processing the CSV rows in a lazy manner and just removing
f.read() fixed the overzealous buffering issue. The corrected code snippet looks like
# src.py import csv import io # Django mandates us to open the file in binary mode. with model_instance.file.open(mode="rb") as f: reader = csv.DictReader( io.TextIOWrapper(f, errors="ignore", encoding="utf-8"), ) with row in reader: # ... data processing goes here.
io.TextIOWrapper wraps the binary file-like object in a way that makes it behave as
if it were opened in text mode. In fact when you open a file in text mode, the native
open returns a file-like object wrapped in
io.TextIOWrapper. You can
find more details about the implementation1 of
open in PEP-31162.
csv.DictReader callable can consume this transformed file-like object without any
further modifications. Since we aren’t calling
f.read() anymore, no overzealous content
buffering is going on here and we can lazily ask for new rows from the
reader object as we
sequentially process them.