📊 Data Classes in Python
Data classes represent a relationship between different concepts.
Introduction
Data classes are user-defined types that let you group related data together. Many types, such as integers, strings, and enumerations, are scalar, they represent one and only one value. Other types, such as lists, sets, and dictionaries, represent homogeneous collections. However, you still need to be able to compose multiple fields of data into a single data type. Dictionaries and tuples are OK at this, but they suffer from a few issues. Readability is tricky, as it can be difficult knowing what a dictionary or tuple contains at runtime. When your data is hard to understand, readers will make incorrect assumptions and won't be able to spot bugs as easily. Data classes are easier to read and understand, and the type checker knows how to naturally handle them. Python first supported data classes in Python 3.7.
Homogeneous collections are collections in which every value has the same type. In contrast, values in heterogeneous collections may have different types within them.
Data Classes in Action
Data classes represent a heterogeneous collection of variables, all rolled into a composite type1. For example, a Fraction
is an excellent example of a composite type. It contains two scalar values: a numerator
and a denominator
.
To represent a fraction with a dataclass
, you do the following:
By building relationships like this, you are adding to the shared vocabulary in your codebase. Instead of developers always needing to implement each field individually, you provide a reusable grouping. Data classes force you to explicitly assign types to your fields, so there's less chance of type confusion among maintainers.
You can not only add fields in a dataclass
, but you are also able to add in behaviors in the form of methods.
Usage
Data classes have some built-in functions that make them really easy to work with.
String Conversion
There are two special methods, __str__
and __repr__
, used to convert your object to its informal and official string representation. They are called when you invoke str()
or repr()
on an object. Data classes define these functions, and they will return the same output by default.
Equality
You can test equality(==
, !=
) between two data classes by default, you can still specify eq=True
when defining your dataclass
explicitly:
By default, equality checks will compare every field across two instances of a dataclass
. You can write your own __eq__
function to override the default functionality for equality checks.
Relational Comparison
By default, data classes do not support relational comparison(<
, >
, <=
, >=
), so you cannot sort the data classes. If you want to be able to define relational comparison, you need to set order=True
in the dataclass
definition. The generated comparison functions will go through each field, comparing them in the order in which they were defined.
Immutability
Sometimes, you need to convey that a dataclass
should not be able to be changed. In that case, you can specify that a dataclass
must be frozen
. To freeze a dataclass
, add a frozen=True
to the dataclass
decorator.
If you want to use your dataclass
in a set or as a key in a dictionary it must be hashable. This means it must define a __hash__
function that takes your object and distills it down to a number. When you freeze a dataclass
, it automatically becomes hashable, as long as you don't explicitly disable equality checking and all fields are hashable.
A frozen data class only prevents its members from being set. If the members are mutable, you are still able to call methods on those members to modify their values. Frozen data classes do not extend immutability to their attributes.
Comparison to Other Types
Data Classes Versus Dictionaries
Dictionaries are fantastic for mapping keys to values, but they are most appropriate when they are homogeneous. When used for heterogeneous data, dictionaries are tougher for humans to reason about. Also type checkers don't know enough about the dictionary to check for errors.
Data classes, however, are a natural fit for fundamentally heterogeneous data. Readers of the code know the exact fields present in the type and type checkers can check for correct usage. If you have heterogeneous data, use a data class before you reach for a dictionary.
Data Classes Versus TypedDict
TypedDict
, introduced in Python 3.8, is another way to store heterogeneous data that makes sense for readers and type checkers. At first glance, TypedDict
and data class solve a very similar problem, it can be tough to decide which one is appropriate. In most cases it would be better to choose a dataclass
, since it provides immutability, comparability, equality and other operations. However, if you are already working with dictionaries, you should reach for a TypedDict
.
Data Classes Versus Named Tuple
namedtuple
is a tuple-like collection type in the collections module. Unlike tuples, it allows for you to name the fields in a tuple like so:
A namedtuple
goes a long way toward making a tuple more readable, but dataclass
provides more benefits that it like:
- Explicitly type annotating your arguments
- Control of immutability, comparability and equality
- Easier to define functions in the type
Summary
Data classes were a game changer when released, because they allowed developers to define heterogeneous types that were fully typed while still staying lightweight. However, as great as data classes are they should not be universally used. A data class, at its heart, represents a conceptual relationship, but it really is only appropriate when the members within the data class are independent of one another. If any of the members should be restricted depending on the other members, a data class will make it harder to reason about your code. Any developer could change the fields during your data classes' lifetime, potentially creating an illegal state. In these cases, classes would be a better choice.
Footnotes
-
Composite types are made up of multiple values, and should always represent some sort of relationship or logical grouping. ↩