I frequently have to write ad-hoc scripts that download a CSV file from s31, do some
processing on it, and then create or update objects in the production database using the
parsed information from the file. In Python, it’s trivial to download any file from s3 via
boto32, and then the file can be read with the csv
module from the standard library.
However, these scripts are usually run from a separate script server and I prefer not to
clutter the server’s disk with random CSV files. Loading the s3 file directly into memory
and reading its contents isn’t difficult but the process has some subtleties. I do this
often enough to justify documenting the workflow here.
Along with boto3, we can leverage Python’s tempfile.NamedTemporaryFile
3 to directly
download the contents of the file to a temporary in-memory file. Afterward, we can do the
processing, create the objects in the DB, and delete the file once we’re done. The
NamedTemporaryFile
class can be used as a context manager and it’ll delete the file
automatically when the with
block ends.
This is quite straightforward with a simple gotcha. Here’s how you’d usually download a file from s3 and save that to a file-like object:
# src.py
import boto3
s3 = boto3.client("s3")
with open("FILE_NAME", "wb") as f:
s3.download_fileobj("BUCKET_NAME", "OBJECT_NAME", f)
Okay but the doc reminds us about this:
The
download_fileobj
method accepts a writeable file-like object. The file object must be opened in binary mode, not text mode.
Opening the file in binary mode is an issue. The CSV reader needs the file to be opened in text mode. This is not an issue when you download the file to disk since you can open the file again in text mode to feed it to the CSV reader. However, we’re trying to avoid saving the file to disk and opening that again in text mode. So, you can’t do this:
# src.py
import boto3
import tempfile
import csv
s3 = boto3.client("s3")
with tempfile.NamedTemporaryFile("wb") as f:
s3.download_fileobj("BUCKET_NAME", "OBJECT_NAME", f)
# The csv file. This will raise an error since csv.DictReader
# expects a file opened in text mode; not binary mode.
csv_reader = csv.DictReader(f)
for row in csv_reader:
# ... do processing
...
The above snippet won’t work because:
The file-like object is opened in binary mode but the
csv.DictReader
expects the file pointer to be opened in text mode. So, it’ll raise an error.Even if you fixed that, the CSV reader wouldn’t be able to read anything since the file currently only allows writing in binary mode, not reading.
Even if you fixed the second issue, the content of the CSV file would be empty. That’s because after boto3 downloads and saves the file to the file object, it sets the file handle to the end of the file. So loading the content from there would result in an empty file. Here’s how I fixed all three of these problems:
# src.py
import boto3
import tempfile
import csv
import io
BUCKET_NAME = "foo-bucket"
OBJECT_NAME = "foo-file.csv"
s3 = boto3.client("s3")
# 'w+b' allows both reading from and writing to the file.
with tempfile.NamedTemporaryFile("w+b") as f:
s3.download_fileobj(BUCKET_NAME, OBJECT_NAME, f)
# This sets the file handle back to the beginning of the file.
# Without this, the loaded file will show no content.
f.seek(0)
# Here, 'io.TextIOWrapper' is converting the binary content of
# the file to be compatible with text content.
csv_reader = csv.DictReader(
io.TextIOWrapper(f, encoding="utf-8"),
)
# Now you're good to go.
for row in csv_reader:
# ... do processing
...
You can see that the snippet first opens a temporary file in w+b
mode which allows both
binary read and write operations. Then it downloads the file from s3 and saves it to the
file-like object.
Once the download is finished, the file handle is placed at the bottom of the file. So,
we’ll need to call f.seek(0)
to place the handle at the beginning of the file; otherwise,
our read operation will yield no content. Also, since the currently opened file object only
allows binary read and write operations, we’ll need to convert it to a text file object
before passing it to the CSV reader. The io.TextIOWrapper
class does exactly that. Once
the file object is in text mode, we pass it to the CSV reader and do further processing.
Recent posts
- Hierarchical rate limiting with Redis sorted sets
- Dynamic shell variables
- Link blog in a static site
- Running only a single instance of a process
- Function types and single-method interfaces in Go
- SSH saga
- Injecting Pytest fixtures without cluttering test signatures
- Explicit method overriding with @typing.override
- Quicker startup with module-level __getattr__
- Docker mount revisited