Python Hashes and Equality

Most Python programmers don’t spend a lot of time thinking about how equality and hashing works. It usually just works. However there’s quite a bit of gotchas and edge cases that can lead to subtle and frustrating bugs once one starts to customize their behavior – especially if the rules on how they interact aren’t understood.

Object equality

Equality in Python is more complicated than most people realize but at its core you have to implement a __eq__(self, other) method. It should return either a boolean value if your class knows how to compare itself to other or NotImplemented if it doesn’t. For inequality checks using !=, the corresponding method is __ne__(self, other).

By default, those methods are inherited from the object class that compares two instances by their identity – therefore instances are only equal to themselves.

A common mistake in Python 2 was to override only __eq__() and forget about __ne__(). Python 3 is friendly enough to implement an obvious __ne__() for you, if you don’t yourself.

Object hashes

An object hash is an integer number representing the value of the object and can be obtained using the hash() function if the object is hashable. To make a class hashable, it has to implement both the __hash__(self) method and the aforementioned __eq__(self, other) method. As with equality, the inherited object.__hash__ method works by identity only: barring the unlikely event of a hash collision, two instances of the same class will always have different hashes, no matter what data they carry.

Since this is usually good enough, most Pythonistas don’t realize there’s even a thing called hashing until they try to add an unhashable object into a set or a dictionary:

>>> set().add({})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
unhashable type: 'dict'

Therefore, hashes are important because sets and dictionaries use them for their lookup tables to quickly find their keys. To do that effectively, they make an important assumption that leads to our first gotcha:

The hash of an object must never change during its lifetime.

So if you decide to do the perfectly sensible thing and define the equality and hash of your object by the hash and equality of a tuple of the instance’s attributes, you have to make sure those attributes never change lest weird things happen:

>>> class C:
...     def __init__(self, x):
...         self.x = x
...     def __repr__(self):
...         return f"C({self.x})"
...     def __hash__(self):
...         return hash(self.x)
...     def __eq__(self, other):
...         return (
...             self.__class__ == other.__class__ and
...             self.x == other.x
...         )
>>> d = dict()
>>> s = set()
>>> c = C(1)
>>> d[c] = 42
>>> s.add(c)
>>> d, s
({C(1): 42}, {C(1)})
>>> c in s and c in d  # c is in both!
True
>>> c.x = 2
>>> c in s or c in d   # c is in neither!?
False
>>> d, s
({C(2): 42}, {C(2)})   # but...it's right there!

Although our mutated c clearly is in both d and s, Python claims it never heard of it!

This explains why all immutable data structures like tuples or strings are hashable while mutable ones like lists or dictionaries aren’t.

To make matters even more confusing, creating an object with the same hash value will also not work because Python is going to throw a call to __eq__ into the mix and C(1) is clearly not equal to C(2):

>>> C(1) in s or C(1) in d
False

Why the equality check? As we’ve established before, a hash is an integer. And even though we have 64 bits to splurge on modern architectures, there’s still the possibility that two objects have the same hash.

Given this behavior we’ve found another assumption made by sets and dictionaries:

Hashable objects which compare equal must have the same hash value.

In other words: if x == y it must follow that hash(x) == hash(y).

Since that’s not true in our case, we can’t access that object by its hash anymore.

What does all of this mean?

You can’t base your hash on mutable values. If an attribute can change in the lifetime of an object, you can’t use it for hashing or very funky things happen.

Generally speaking, immutable objects are the cleanest approach and they come with many other upsides your FP-loving friends will happily explain to you at length.

Practically speaking though, that’s not always possible – for performance reasons alone. Python just isn’t conceived with immutability in mind like, say Clojure.

Hashes can be less picky than equality checks. Since key lookups are always followed by an equality check, your hashes don’t have to be unique. That means that you can compute your hash over an immutable subset of attributes that may or may not be a unique “primary key” for the instance.

You can take this property to town by returning a constant hash and make the set or dictionary work purely on equality checks. However that would regress them into a list and lead to terrible performance.

You shouldn’t compare by value but hash by identity. This approach fails to take the second assumption into account. That said, it usually works perfectly fine because it takes a rather unlikely hash collision to become a problem.

However it’s still a violation of the contract with the Python runtime and may lead to problems; albeit only possibly in the future. Python feels so strongly about this that as of Python 3, it automatically makes classes unhashable if you implement __eq__ but not __hash__. Python 2 lets you happily shoot off your foot.

Object equality

Object hashes

What does all of this mean?

Hynek Schlawack