Unless I’m hand rolling my own ORM-like feature or validation logic, I rarely need to write
custom descriptors in Python. The built-in descriptor magics like @classmethod
,
@property
, @staticmethod
, and vanilla instance methods usually get the job done.
However, every time I need to dig my teeth into descriptors, I reach for this fantastic how
to1 guide by Raymond Hettinger. You should definitely set aside the time to read it if
you haven’t already. It has helped me immensely to deepen my understanding of how many of
the fundamental language constructs are wired together underneath.
Descriptors are considered fairly advanced Python features and can easily turn into footguns if used carelessly. Recently, while working on an app with a descriptor-based data validator, I discovered a subtle but obvious bug that was hemorrhaging memory all across the app. The app was using a descriptor to validate class variables while simultaneously tracking instances where validation occurred. This validator was being used all over the codebase, so it slowly started blowing up memory usage in the background. The problem is that it was keeping hard references to everything it validated, so none of those objects could get garbage collected. But the really sneaky thing was how slowly and secretly the problem happened—the leakage built up bit by bit over time even when people used the validator in totally innocuous ways.
Here’s a simpler example of a validation descriptor that tracks the instances it’s applied to:
class Within:
# The instances are tracked here
_seen = {}
def __init__(self, min, max):
self.min = min
self.max = max
def __set_name__(self, instance, name):
self.name = name
def __get__(self, instance, instance_type):
return instance.__dict__[self.name]
def __set__(self, instance, value):
if not self.min <= value <= self.max:
raise ValueError(
f"{value} is not within {self.min} and {self.max}"
)
instance.__dict__[self.name] = value
# Track the instances that have been seen.
# This is the memory leak.
self._seen[instance] = value
The Within
descriptor validates that the values assigned to instance attributes are within
a specified min and max range. It does this by implementing the __set__
and __get__
dunder methods. When the descriptor is accessed via instance.attrname
, the __get__
method is called which returns the value from the instance’s dict. When a value is assigned
via instance.attrname = value
, the __set__
method is called which validates the value is
within the min/max bounds before setting it on the instance. A memory leak occurs because
the _seen
dict keeps a reference to every instance the descriptor has been accessed on.
This prevents the instances from being garbage collected even if there are no other
references to them. You can use the descriptor and observe the memory leakage like this:
import gc
class Exam:
math = Within(0, 100)
physics = Within(0, 100)
chemistry = Within(0, 100)
def __init__(self, math, physics, chemistry):
self.math = math
self.physics = physics
self.chemistry = chemistry
if __name__ == "__main__":
exam = Exam(30, 50, 40)
exam.math = 60
# Delete the exam instance
del exam
# Force garbage collection
gc.collect()
# Check the strong reference to the deleted instance
print(tuple(Within._seen.items()))
Here, we’re defining an Exam
class that uses the Within
descriptor to apply constraints
on the values of the math
, physics
, and chemistry
class variables. Then we initialize
the class instance and mutate the math
attribute to demonstrate that the validator is
working as expected. The instance of the Exam
class is saved to the _seen
dictionary of
the descriptor when the __set__
method is called. Next, we delete the Exam
instance and
force garbage collection. However, when you run the snippet, you’ll see that it prints the
following:
((<__main__.Exam object at 0x10466ca10>, 60),)
This indicates that although we’ve deleted the Exam
instance, it can’t be fully garbage
collected since the Within
descriptor’s _seen
dictionary holds a strong reference to it.
Dispel the malady
Once I spotted the bug, the solution was fairly simple. Don’t keep strong references to the
class instances if you don’t need to. Also, use a more robust tool like Pydantic2 to
perform validation but I digress here! Using a weakref.WeakKeyDictionary
instead of a
regular dict for _seen
would prevent the memory leakage by avoiding strong references to
the deleted instances. Since WeakKeyDictionary
holds weak references to the keys, if all
other strong references to an instance are deleted, the garbage collector can reclaim it.
The weak reference in WeakKeyDictionary
won’t keep the instance alive. Here’s how you’d
modify Within
to fix the issue:
from weakref import WeakKeyDictionary
class Within:
_seen = WeakKeyDictionary() # Drop in dict replacement
def __init__(self, min, max): ...
def __set_name__(self, instance, name): ...
def __get__(self, instance, instance_type): ...
def __set__(self, instance, value): ...
The modified descriptor is a drop-in replacement for the previous one—minus the memory
leakage issue. So in the last snippet, when exam
is deleted and the gc is called, weakref
allows the instance to be garbage collected correctly instead of remaining in memory due to
the strong reference in _seen
. The weak reference doesn’t interfere with gc freeing up the
memory as desired. If you run the demonstration snippet again, this time you’ll see that
once we force the gc to collect the garbage, the _seen
container gets emptied out.
exam = Exam(30, 50, 40)
exam.math = 60
# Delete the exam instance
del exam
# Force garbage collection
gc.collect()
# Check the strong reference to the deleted instance
print(tuple(Within._seen.items()))
This will print an empty tuple:
()
This also means that now Within
will only keep track of instances that are alive in
memory.
Recent posts
- SSH saga
- Injecting Pytest fixtures without cluttering test signatures
- Explicit method overriding with @typing.override
- Quicker startup with module-level __getattr__
- Docker mount revisited
- Topological sort
- Writing a circuit breaker in Go
- Discovering direnv
- Notes on building event-driven systems
- Bash namerefs for dynamic variable referencing