Wednesday, July 23, 2025

Pydantic or Dataclass or Namedtuple or Just a Class with Attributes

Introduction

While working for a client, I was given code with the following structure.

There was a SomeResults class

>>> class SomeResults:
...     topic_1: list[str] | None
...     topic_2: list[str] | None
...     topic_3: list[str] | None
...
>>>

There was some function that had to return SomeResults

>>> def some_fcn() -> SomeResults:
...     raise NotImplemented()
...
>>>

My task was to create the implementation. There were clear instructions on what the implementation was to output, and initial instructions were provided on how to get started. The associated details are not pertinent to this blog post and consequently have been omitted. This is the standard technique obfuscating the specific client problem.

However, I was puzzled as to how to structure the code. Do I access object attributes directly? Do I use a namedtuple? Do I use a dataclass? Do I use Pydantic? We will explore each of these below with their associated pros and cons.

The applicable requirements and restrictions imposed by the client are not going to be stated upfront. This way, the reader is forced to work through the thought process/approaches presented here. Anyways, sometimes everyone needs a little mystery. Also, in real life, it is rare to have all the requirements and restrictions stated upfront.

"Optional[list[str]]" Should Replace "list[str] | None"

The Python documentation states, "if an explicit value of None is allowed, the use of Optional is appropriate, whether the argument is optional or not". Unfortunately, one of the constraints imposed by the client was that the original declaration of SomeResults could not be modified. Consequently, we have to use "list[str] | None".

If you are interested in the details, refer to

Approach 1: Access Object's Attributes Directly

The final code would look something like the following

>>> def some_fcn() -> SomeResults:

...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]

...     result = SomeResults()
...     result.topic_1 = topic_a
...     result.topic_2 = topic_b
...     result.topic_3 = topic_c

...     return result
...

>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']

>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']

>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']

The computation of the topics is complex, so that it would be done separately. This is mimicked by

topic_a = ["topic_1x", "topic_1y", "topic_1z"]

topic_b = ["topic_2x", "topic_2y", "topic_2z"]

topic_c = ["topic_3x", "topic_3y", "topic_3z"]

Once the topics are computed, they are gathered together to create a result.

result = SomeResults()

result.topic_1 = topic_a
result.topic_2 = topic_b
result.topic_3 = topic_c

The advantage of the above approach is that it is quick. No manual implementation of methods like __init__().

The disadvantage of the above approach is that setting individual attributes directly using dot notation. This is not considered Pythonic. An alternative approach is to use getters and setter methods. Getters and setters are a legacy pattern from C++ that make library packaging practical and avoid recompiling the entire world. They should be avoided in Python. Another approach is to use properties. This is considered to be Pythonic. These different approaches generate much controversy. Consequently, the following links will allow you to gather information that can then be used to make a decision that is appropriate to your use case and time constraints.

Approach 2: Use Namedtuple

The final code would look something like the following

>>> from collections import namedtuple

>>> SomeResults = namedtuple('SomeResults', ['topic_1', 'topic_2', 'topic_3'])

>>> def some_fcn() -> SomeResults:
...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]
...     return SomeResults(
...         topic_1=topic_a,
...         topic_2=topic_b,
...         topic_3=topic_c
...     )
...

>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']

>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']

>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']

>>>

The advantage of namedtuples is that they are immutable. This immutability is helpful because you want to combine the results of each of the topics once at the end. You don't want to combine the results of each of the topics over and over.

The disadvantage of namedtuples is that one has to provide a value for each topic or explicitly default each value to None (SomeResults = namedtuple('SomeResults', ['topic_1', 'topic_2', 'topic_3'], defaults=(None, None, None))). This seems trivial. Unfortunately, this client had so many topics that manually defaulting all to None was impractical.

If you want to brush up on namedtuples, consider using the article "Write Pythonic and Clean Code With namedtuple" by Leodanis Pozo Ramos.

Approach 3: Use Dataclass

The final code would look something like the following

>>> from dataclasses import dataclass

>>> @dataclass
... class SomeResults:
...     topic_1: list[str] | None
...     topic_2: list[str] | None
...     topic_3: list[str] | None
...

>>> def some_fcn() -> SomeResults:
...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]
...     return SomeResults(
...         topic_1=topic_a,
...         topic_2=topic_b,
...         topic_3=topic_c
...     )
...

>>> print(some_fcn())
SomeResults(topic_1=['topic_1x', 'topic_1y', 'topic_1z'], topic_2=['topic_2x', 'topic_2y', 'topic_2z'], topic_3=['topic_3x', 'topic_3y', 'topic_3z'])

The advantage of using a dataclass is that it is a "natural" fit because SomeResults is a class primarily used for storing data. Also, it automatically generates boilerplate methods.

The disadvantage of dataclasses is that there is no runtime data validation.

Also, the use of a decorator might seem to violate the constraint that the original declaration of SomeResults could not be modified. The strict interpretation is that we have modified the original declaration through the use of a decorator. However, it is a local modification of the implementation for a specific purpose. As a side note, if you are ever in a situation where you can't modify the code but at the same time you have to modify the code, think decorators.

If you want to brush up on dataclasses, consider using the article "Data Classes in Python 3.7+ (Guide)" by Geir Arne Hjelle.

Approach 4: Use Pydantic

The final code would look something like the following

>>> from pydantic import BaseModel

>>> class SomeResults(BaseModel):
...     topic_1: list[str] | None
...     topic_2: list[str] | None
...     topic_3: list[str] | None
...

>>> def some_fcn() -> SomeResults:
...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]
...     return SomeResults(
...         topic_1=topic_a,
...         topic_2=topic_b,
...         topic_3=topic_c
...     )
...

>>> print(some_fcn())
topic_1=['topic_1x', 'topic_1y', 'topic_1z'] topic_2=['topic_2x', 'topic_2y', 'topic_2z'] topic_3=['topic_3x', 'topic_3y', 'topic_3z']

This particular client was processing web pages from the internet, and so automatic runtime data validation was needed. This makes Pydantic a natural fit.

A con would be that Pydantic introduces an external dependency. The alternative is to write, debug, and maintain the equivalent code for this particular use case yourself. Not sure how realistic that would be. Also, by introducing Pydantic to your tech stack, a lot of useful functionality becomes available like JSON conversions.

Another con is that there is a higher overhead that arises from the validation.

Also, notice that the original class SomeResults is modified to be a subclass of BaseModel. For this particular client, this is not just a con but a deal breaker. The original class SomeResults cannot be modified.

If you want to brush up on Pydantic, consider using the article "Pydantic: Simplifying Data Validation in Python" by Harrison Hoffman.

Approach 5: Use Pydantic dataclass Decorator

The final code would look something like the following

>>> from pydantic.dataclasses import dataclass

>>> @dataclass
... class SomeResults:
...     topic_1: list[str] | None
...     topic_2: list[str] | None
...     topic_3: list[str] | None
...

>>> def some_fcn() -> SomeResults:
...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]
...     return SomeResults(
...         topic_1=topic_a,
...         topic_2=topic_b,
...         topic_3=topic_c
...     )
...

>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']

>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']

>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']

>>>

Pydantic dataclass decorator satisfies all the requirements of the client. It supports runtime data validation. Also, no changes have been made to the original definition of the class SomeResults.

Summary

As shown above, there are many ways to ensure that some function returns a specific type of output. It is realized that the 5 approaches are not a thorough listing of all the possible approaches. However, they are illustrative, and there are length constraints imposed by people casually reading blog posts.

We started with "Approach 1," which is the simplest. We then used namedtuples. Unfortunately, this particular client could not use them because they needed the ability to default values to None. This forced us to move on to dataclasses. However, this particular client needed runtime data validation and so Pydantic was needed. We still did not meet the client's requirements because we modified the original class SomeResults. We then used Pydantic's dataclass decorator so that we did not have to modify the class SomeResults.

19 comments:

misterkrinkleJuly 25, 2025 at 5:32 AM
Excellent write-up as always! I love Pydantic but never messed with the decorators so that is a neat thing to be aware of. My initial impression here is that yet another way that this could be potentially done would be to have a `Dict[str, List[str | None]]` attribute keyed on topics, however that really only makes sense if topics are dynamic data. I guess it all depends on the data model.
ReplyDelete
Replies
The Answer BrothersJuly 25, 2025 at 9:29 AM
Great article, Robert! I have not had much opportunity to examine this library, but it seems very powerful. Agree that the obsession of doing everything inside Python native does not make sense.

I have had mixed success with 'reflection' solutions since they depend on internal knowledge and immutable internal structure. Will take a look at this library for sure.
ReplyDelete
Replies
Jason R. CoombsJuly 27, 2025 at 7:02 AM
Nice description. Good examples and thanks for sharing. Personally, I don't find option 1 horrifying, so I'm not convinced that it's an antipattern. And with other libraries like [typeguard](https://pypi.org/project/typeguard/), you might get runtime validation. And if you are able to re-define SomeResults, you can easily set `None` as the unset default.
I'm also reluctant to accept the use of Pydantic. It's more than just a dependency. It's a _non-pure_ dependency, meaning it will only work in environments where extension modules can be compiled and installed. Granted, that's most environments, but it excludes use-cases such as vendored libraries, cygwin, web-based interpreters, and embedded environments. For example, I tried to use pydantic in inflect, but found that [non viable](https://github.com/jaraco/inflect/issues/195). Pydantic is wonderful and I love what it does, but be considerate about the constraints it creates when adopting it.
ReplyDelete
Replies
gakhovJuly 30, 2025 at 3:27 AM
I personally like native classes and try to think twice before moving from them. However, nowadays it's unlikely that you are working with native Python code only. You probably already using Pydantic (or an alternative) in the project, so using dataclass from Pydantic (or even BaseModel, if you have a bit more complex use-case) is a good practical choice.
ReplyDelete
Replies
suJuly 30, 2025 at 4:20 AM
Thanks for sharing the multiple pythonic approaches to achieve one thing. My preference, given Python is a glue tool, is actually approach 1 as a starting point -- it is native Python, clean and easy to read and maintain. If one needs to make it "pattern", there are tricks such as gets and sets; explicit type validation is not too ugly to add; if such needs repeat in many places, then consider using decorators for example.

(if one need fast running time, and the impmentnation here is profiled to be the bottleneck, maybe considering using C and pybinding, or other language directly)
ReplyDelete
Replies
Joel McCrackenJuly 30, 2025 at 8:14 AM
Can you elaborate on how technique 5 does not make any changes to the original definition of SomeResults? It seems to me that you are re-defining it (though I _think_ it has the same definition? Is that acceptable?)

Another question is, is there a significant semantic difference between a None value and an empty list? I'm getting the impression this is attempting to get closer to "make invalid states unrepresentable", but I'm not sure. Either way, it appears that the original definition has that, so.
ReplyDelete
Replies
Robert LucenteJuly 31, 2025 at 4:16 PM
The original code had both "list[str] | None" and "Optional[List[str]]". I received comments about this.

The blog post has been updated by adding the section titled: "Optional[list[str]]" Should Replace "list[str] | None". Also, all the code snippets have been updated to use "list[str] | None".

In Python 3.10, unions could be written as X | Y. Also, there was PEP 604 titled "Allow writing union types as X | Y". On top of that, the current documentation states "To define a union, use e.g. Union[int, str] or the shorthand int | str. Using that shorthand is recommended.".

However, the twist for this particular use case is that explicit value of None is allowed. So, we should follow the official documentation which states "if an explicit value of None is allowed, the use of Optional is appropriate, whether the argument is optional or not".

The double twist for this particular use case is that he constraint imposed by the client was that the original declaration of SomeResults could not be modified. So, backk to using "list[str] | None".
ReplyDelete
Replies
M. BransfieldAugust 8, 2025 at 8:56 AM
This is a great primer! Extremely helpful for anyone choosing between data containers in Python.

While walking us through five different approaches, I appreciate your balancing practical implementation with conceptual explanation.
ReplyDelete
Replies

Add comment