📊 Data Classes in Python

Data classes represent a relationship between different concepts.


Introduction

Data classes are user-defined types that let you group related data together. Many types, such as integers, strings, and enumerations, are scalar, they represent one and only one value. Other types, such as lists, sets, and dictionaries, represent homogeneous collections. However, you still need to be able to compose multiple fields of data into a single data type. Dictionaries and tuples are OK at this, but they suffer from a few issues. Readability is tricky, as it can be difficult knowing what a dictionary or tuple contains at runtime. When your data is hard to understand, readers will make incorrect assumptions and won't be able to spot bugs as easily. Data classes are easier to read and understand, and the type checker knows how to naturally handle them. Python first supported data classes in Python 3.7.

Homogeneous collections are collections in which every value has the same type. In contrast, values in heterogeneous collections may have different types within them.

Data Classes in Action

Data classes represent a heterogeneous collection of variables, all rolled into a composite type1. For example, a Fraction is an excellent example of a composite type. It contains two scalar values: a numerator and a denominator.

from fraction import Fraction
Fraction(numerator=1, denominator=2)

To represent a fraction with a dataclass, you do the following:

from dataclasses import dataclass
@dataclass
class MyFraction:
    numerator: int
    denominator: int
MyFraction(1, 2)

By building relationships like this, you are adding to the shared vocabulary in your codebase. Instead of developers always needing to implement each field individually, you provide a reusable grouping. Data classes force you to explicitly assign types to your fields, so there's less chance of type confusion among maintainers.

You can not only add fields in a dataclass, but you are also able to add in behaviors in the form of methods.

Usage

Data classes have some built-in functions that make them really easy to work with.

String Conversion

There are two special methods, __str__ and __repr__, used to convert your object to its informal and official string representation. They are called when you invoke str() or repr() on an object. Data classes define these functions, and they will return the same output by default.

Equality

You can test equality(==, !=) between two data classes by default, you can still specify eq=True when defining your dataclass explicitly:

from copy import deepcopy
from dataclasses import dataclass
@dataclass(eq=True)
class MyFraction:
    numerator: int
    denominator: int
num1 = MyFraction(1, 2)
num2 = MyFraction(2, 3)
num1 == num2           # False
num1 == deepcopy(num1) # True

By default, equality checks will compare every field across two instances of a dataclass. You can write your own __eq__ function to override the default functionality for equality checks.

Relational Comparison

By default, data classes do not support relational comparison(<, >, <=, >=), so you cannot sort the data classes. If you want to be able to define relational comparison, you need to set order=True in the dataclass definition. The generated comparison functions will go through each field, comparing them in the order in which they were defined.

Immutability

Sometimes, you need to convey that a dataclass should not be able to be changed. In that case, you can specify that a dataclass must be frozen. To freeze a dataclass, add a frozen=True to the dataclass decorator.

If you want to use your dataclass in a set or as a key in a dictionary it must be hashable. This means it must define a __hash__ function that takes your object and distills it down to a number. When you freeze a dataclass, it automatically becomes hashable, as long as you don't explicitly disable equality checking and all fields are hashable.

A frozen data class only prevents its members from being set. If the members are mutable, you are still able to call methods on those members to modify their values. Frozen data classes do not extend immutability to their attributes.

Comparison to Other Types

Data Classes Versus Dictionaries

Dictionaries are fantastic for mapping keys to values, but they are most appropriate when they are homogeneous. When used for heterogeneous data, dictionaries are tougher for humans to reason about. Also type checkers don't know enough about the dictionary to check for errors.

Data classes, however, are a natural fit for fundamentally heterogeneous data. Readers of the code know the exact fields present in the type and type checkers can check for correct usage. If you have heterogeneous data, use a data class before you reach for a dictionary.

Data Classes Versus TypedDict

TypedDict, introduced in Python 3.8, is another way to store heterogeneous data that makes sense for readers and type checkers. At first glance, TypedDict and data class solve a very similar problem, it can be tough to decide which one is appropriate. In most cases it would be better to choose a dataclass, since it provides immutability, comparability, equality and other operations. However, if you are already working with dictionaries, you should reach for a TypedDict.

Data Classes Versus Named Tuple

namedtuple is a tuple-like collection type in the collections module. Unlike tuples, it allows for you to name the fields in a tuple like so:

from collections import namedtuple
MyFraction = namedtuple("MyFraction", ["numerator", "denominator"])
num = MyFraction(1, 2)

A namedtuple goes a long way toward making a tuple more readable, but dataclass provides more benefits that it like:

  • Explicitly type annotating your arguments
  • Control of immutability, comparability and equality
  • Easier to define functions in the type

Summary

Data classes were a game changer when released, because they allowed developers to define heterogeneous types that were fully typed while still staying lightweight. However, as great as data classes are they should not be universally used. A data class, at its heart, represents a conceptual relationship, but it really is only appropriate when the members within the data class are independent of one another. If any of the members should be restricted depending on the other members, a data class will make it harder to reason about your code. Any developer could change the fields during your data classes' lifetime, potentially creating an illegal state. In these cases, classes would be a better choice.

Footnotes

  1. Composite types are made up of multiple values, and should always represent some sort of relationship or logical grouping. ↩