Wednesday, July 23, 2025

Pydantic or Dataclass or Namedtuple or Just a Class with Attributes

 

Introduction

While working for a client, I was given code with the following structure.

There was a SomeResults class

>>> class SomeResults:
...     topic_1: list[str] | None
...     topic_2: list[str] | None
...     topic_3: list[str] | None
...
>>>

There was some function that had to return SomeResults

>>> def some_fcn() -> SomeResults:
...     raise NotImplemented()
...
>>>

My task was to create the implementation. There were clear instructions on what the implementation was to output, and personal interactions were provided on how to get started.

However, I was puzzled as to how to structure the code. Do I access object attributes directly? Do I use a namedtuple? Do I use a dataclass? Do I use Pydantic? We will explore each of these below with their associated pros and cons.

The applicable requirements and restrictions imposed by the client are not going to be stated upfront. This way, the reader is forced to work through the thought process/approaches presented here. Anyways, sometimes everyone needs a little mystery. Also, in real life, it is rare to have all the requirements and restrictions stated upfront.

"Optional[list[str]]" Should Replace "list[str] | None"

The Python documentation states, "if an explicit value of None is allowed, the use of Optional is appropriate, whether the argument is optional or not". Unfortunately, one of the constraints imposed by the client was that the original declaration of SomeResults could not be modified. Consequently, have to use "list[str] | None".

If you are interested in the details, refer to

Approach 1: Access Object's Attributes Directly

The final code would look something like the following

>>> def some_fcn() -> SomeResults:

...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]

...     result = SomeResults()
...     result.topic_1 = topic_a
...     result.topic_2 = topic_b
...     result.topic_3 = topic_c

...     return result
...

>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']

>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']

>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']

The computation of the topics is complex, so that it would be done separately. This is mimicked by

topic_a = ["topic_1x", "topic_1y", "topic_1z"]

topic_b = ["topic_2x", "topic_2y", "topic_2z"]

topic_c = ["topic_3x", "topic_3y", "topic_3z"]

Once the topics are computed, they are gathered together to create a result.

result = SomeResults()

result.topic_1 = topic_a
result.topic_2 = topic_b
result.topic_3 = topic_c

The advantage of the above approach is that it is quick. No manual implementation of methods like __init__().

The disadvantage of the above approach is that it sets individual attributes directly using dot notation. This is not considered Pythonic. An alternative approach is to use getters and setter methods. Getters and setters are a legacy from C++ that make library packaging practical and avoid recompiling the entire world. They should be avoided in Python. Another approach is to use properties. This is considered to be Pythonic. These different approaches generate much controversy. Consequently, the following links will allow you to gather information that can then be used to make a decision that is appropriate to your use case and time constraints.

Approach 2: Use Namedtuple

The final code would look something like the following

>>> from collections import namedtuple

>>> SomeResults = namedtuple('SomeResults', ['topic_1', 'topic_2', 'topic_3'])

>>> def some_fcn() -> SomeResults:
...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]
...     return SomeResults(
...         topic_1=topic_a,
...         topic_2=topic_b,
...         topic_3=topic_c
...     )
...

>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']

>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']

>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']

>>>

The advantage of namedtuples is that they are immutable. This immutability attribute is helpful because you want to combine the results of each of the topics once at the end. You don't want to combine the results of each of the topics over and over.

The disadvantage of namedtuples is that one has to provide a value for each topic or explicitly default each value to None (SomeResults = namedtuple('SomeResults', ['topic_1', 'topic_2', 'topic_3'], defaults=(None, None, None))). This seems trivial. Unfortunately, this particular client had so many topics that it would be annoying to initially set all the topics to None.

If you want to brush up on namedtuples, consider using the article  "Write Pythonic and Clean Code With namedtuple" by Leodanis Pozo Ramos.

Approach 3: Use Dataclass

The final code would look something like the following

>>> from dataclasses import dataclass

>>> @dataclass
... class SomeResults:
...     topic_1: list[str] | None
...     topic_2: list[str] | None
...     topic_3: list[str] | None
...

>>> def some_fcn() -> SomeResults:
...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]
...     return SomeResults(
...         topic_1=topic_a,
...         topic_2=topic_b,
...         topic_3=topic_c
...     )
...

>>> print(some_fcn())
SomeResults(topic_1=['topic_1x', 'topic_1y', 'topic_1z'], topic_2=['topic_2x', 'topic_2y', 'topic_2z'], topic_3=['topic_3x', 'topic_3y', 'topic_3z'])

The advantage of using a dataclass is that it is a "natural" fit because SomeResults is a class primarily used for storing data. Also, it automatically generates boilerplate methods.

The disadvantage of dataclasses is that there is no runtime data validation.

Also, the use of a decorator might seem to violate the constraint that the original declaration of SomeResults could not be modified. The strict interpretation is that we have modified the original declaration through the use of a decorator. However, it is a local modification of the implementation for a specific purpose. As a side note, if you are ever in a situation where you can't modify the code but at the same time you have to modify the code, think decorators.

If you want to brush up on dataclasses, consider using the article  "Data Classes in Python 3.7+ (Guide)" by Geir Arne Hjelle.

Approach 4: Use Pydantic

The final code would look something like the following

>>> from pydantic import BaseModel

>>> class SomeResults(BaseModel):
...     topic_1: list[str] | None
...     topic_2: list[str] | None
...     topic_3: list[str] | None
...

>>> def some_fcn() -> SomeResults:
...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]
...     return SomeResults(
...         topic_1=topic_a,
...         topic_2=topic_b,
...         topic_3=topic_c
...     )
...

>>> print(some_fcn())
topic_1=['topic_1x', 'topic_1y', 'topic_1z'] topic_2=['topic_2x', 'topic_2y', 'topic_2z'] topic_3=['topic_3x', 'topic_3y', 'topic_3z']

This particular client was processing web pages from the internet, and so automatic runtime data validation was needed. This makes Pydantic a natural fit.

A con would be that Pydantic introduces an external dependency. The alternative is to write, debug, and maintain the equivalent code for this particular use case yourself. Not sure how realistic that would be. Also, by introducing Pydantic to your tech stack, a lot of useful functionality becomes available like JSON conversions.

Another con is that there is a higher overhead that arises from the validation.

Also, notice that the original class SomeResults is modified to be a subclass of BaseModel. For this particular client, this is not just a con but a deal breaker. The original class SomeResults cannot be modified.

If you want to brush up on Pydantic, consider using the article  "Pydantic: Simplifying Data Validation in Python" by Harrison Hoffman.

Approach 5: Use Pydantic dataclass Decorator

The final code would look something like the following

>>> from pydantic.dataclasses import dataclass

>>> @dataclass
... class SomeResults:
...     topic_1: list[str] | None
...     topic_2: list[str] | None
...     topic_3: list[str] | None
...

>>> def some_fcn() -> SomeResults:
...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]
...     return SomeResults(
...         topic_1=topic_a,
...         topic_2=topic_b,
...         topic_3=topic_c
...     )
...

>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']

>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']

>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']

>>>

Pydantic dataclass decorator satisfies all the requirements of the client. It supports runtime data validation. Also, no changes have been made to the original definition of the class SomeResults.

Summary

As shown above, there are many ways to ensure that some function returns a specific type of output. It is realized that the 5 approaches are not a thorough listing of all the possible approaches. However, they are illustrative, and there are length constraints imposed by people casually reading blog posts.

We started with "Approach 1," which is the simplest. We then used namedtuples. Unfortunately, this particular client could not use them because they needed the ability to default values to None. This forced us to move on to dataclasses. However, this particular client needed runtime data validation and so Pydantic was needed. We still did not meet the client's requirements because we modified the original class SomeResults. We then used Pydantic's dataclass decorator so that we did not have to modify the class SomeResults.

19 comments:

  1. Excellent write-up as always! I love Pydantic but never messed with the decorators so that is a neat thing to be aware of. My initial impression here is that yet another way that this could be potentially done would be to have a `Dict[str, List[str | None]]` attribute keyed on topics, however that really only makes sense if topics are dynamic data. I guess it all depends on the data model.

    ReplyDelete
    Replies
    1. >Excellent write-up as always!

      Thanks

      >I love Pydantic but never messed with the decorators so that is a neat thing to be aware of

      I too was not aware of it until recently. Luckly, I had a good friend that made me aware of it via a private email. Also, several other people commented that they were not aware of Pydantic's dataclass.

      >... here is ... yet another way ... Dict[str, List[str | None]]

      Thanks for proposing an alternative data structure

      >I guess it all depends on the data model

      Yeah - Creating the data model / data structure for a particular use case is tough

      >... it all depends

      :-)

      Delete
  2. Great article, Robert! I have not had much opportunity to examine this library, but it seems very powerful. Agree that the obsession of doing everything inside Python native does not make sense.

    I have had mixed success with 'reflection' solutions since they depend on internal knowledge and immutable internal structure. Will take a look at this library for sure.

    ReplyDelete
    Replies
    1. >Great article, Robert!

      Thanks

      >Agree that the obsession of doing everything inside Python native does not make sense

      Yeah - I think that is one of the reasons that Python became so popular - There is a basic simple core Python language - Then you have an entire ecosystem from which to select a solution that works for your particular use case

      Delete
  3. Nice description. Good examples and thanks for sharing. Personally, I don't find option 1 horrifying, so I'm not convinced that it's an antipattern. And with other libraries like [typeguard](https://pypi.org/project/typeguard/), you might get runtime validation. And if you are able to re-define SomeResults, you can easily set `None` as the unset default.
    I'm also reluctant to accept the use of Pydantic. It's more than just a dependency. It's a _non-pure_ dependency, meaning it will only work in environments where extension modules can be compiled and installed. Granted, that's most environments, but it excludes use-cases such as vendored libraries, cygwin, web-based interpreters, and embedded environments. For example, I tried to use pydantic in inflect, but found that [non viable](https://github.com/jaraco/inflect/issues/195). Pydantic is wonderful and I love what it does, but be considerate about the constraints it creates when adopting it.

    ReplyDelete
    Replies
    1. >Nice description. Good examples

      Thanks

      >Personally, I don't find option 1 horrifying, so I'm not convinced that it's an antipattern

      Yeah - I made the mistake of using too strong of a wording. I have added an entire paragraph at the end of approach 1 which starts with "The disadvantage of the above approach is that setting individual attributes directly using dot notation. This is not considered Pythonic." - There are links afterwards which will allow people to explore the topic for themselves and they can make their own choice.

      >typeguard

      typeguard: 1.7k stars on github

      Pydantic: 24.7k stars on github

      >... tried to use pydantic in inflect, but found that [non viable](https://github.com/jaraco/inflect/issues/195)

      >Pydantic is wonderful and I love what it does

      >but be considerate about the constraints it creates when adopting it

      Fair enough

      Delete
  4. I personally like native classes and try to think twice before moving from them. However, nowadays it's unlikely that you are working with native Python code only. You probably already using Pydantic (or an alternative) in the project, so using dataclass from Pydantic (or even BaseModel, if you have a bit more complex use-case) is a good practical choice.

    ReplyDelete
    Replies
    1. Python has unfortunately drank the Dependency Hell kool-aid and now has to lie in this bed. Even if you brush aside the security implications, it is aggravating that even simple projects now require multiple gigabytes of dependencies to be acquired, which sucks especially bad if you are working in a highly modular application. Won't someone think of the SSD wear leveling?! :)

      Delete
    2. >Python has unfortunately drank the Dependency Hell kool - aid

      >Even if you brush aside the security implications

      What is the alternative?

      Writte your own?

      You can still elect to do that if you wish

      >require multiple gigabytes of dependencies

      Interesting

      When you get a chance please provide some examples

      I can't remember not even once where this happened to me

      >SSD wear leveling

      At least the SSD vendors will be happy :-)

      Delete
  5. Thanks for sharing the multiple pythonic approaches to achieve one thing. My preference, given Python is a glue tool, is actually approach 1 as a starting point -- it is native Python, clean and easy to read and maintain. If one needs to make it "pattern", there are tricks such as gets and sets; explicit type validation is not too ugly to add; if such needs repeat in many places, then consider using decorators for example.

    (if one need fast running time, and the impmentnation here is profiled to be the bottleneck, maybe considering using C and pybinding, or other language directly)

    ReplyDelete
    Replies
    1. >My preference ... is actually approach 1 as a starting point

      Notice the use of the phrase "as a starting point"

      I have added an entire paragraph at the end of approach 1 which starts with "The disadvantage of the above approach is that setting individual attributes directly using dot notation. This is not considered Pythonic." - There are links afterwards which will allow people to explore the topic for themselves and they can make their own choice.

      >gets and sets

      The last paragraph of approach 1 states: An alternative approach is to use getters and setter methods. Getters and setters are a left over from C++ to make library packaging practical and avoid recompiling the world. They should be avoided in Python..

      > if one need fast running time and the impmentnation here is profiled to be the bottleneck maybe consider using ...

      Thanks for the suggestions

      Unfortunately, I cannot include ways to address the various cons. It would make the blog post to long. People expect blog posts to be a quick read and if they are not, they will move on to others that are.

      Delete
  6. Can you elaborate on how technique 5 does not make any changes to the original definition of SomeResults? It seems to me that you are re-defining it (though I _think_ it has the same definition? Is that acceptable?)

    Another question is, is there a significant semantic difference between a None value and an empty list? I'm getting the impression this is attempting to get closer to "make invalid states unrepresentable", but I'm not sure. Either way, it appears that the original definition has that, so.

    ReplyDelete
    Replies
    1. Oh, rereading, I see that you address this (though I didn't realize it at first): "As a side note, if you are ever in a situation where can't modify the code but at the same time you have to modify the code, think decorators."

      It is curious to me that decorating something isn't the same as modifying it.

      Are there some API properties between components here that must be maintained, and "can't modify SomeResults" is just a shorthand for saying so? Is there some reason why duck type polymorphism wouldn't work here?

      Delete
    2. >is there a significant semantic difference between a None value and an empty list?

      Humm

      There are multiple ways to look at this

      Approach 1: Data Types


      >>> type(None)


      >>> type([])


      Approach 2: Truthiness


      >>> if None: print("Hello")
      ...

      >>> if not None: print("Hello")
      ...
      Hello

      >>> if []: print("Hello")
      ...

      >>> if not []: print("Hello")
      ...
      Hello

      Approach 3: Equivalent to asking why use NULL in a database

      In databases, NULL is a special marker used to indicate that a data value is unknown, missing, or not applicable for a particular column in a row

      It is distinct from an empty string, a zero, or any other defined value

      Approach 4: Go with what the client wants

      In this particular use case, the client wanted an explicit None and so went w/ that

      There was no time to engage the client in a conversation about this

      Delete
    3. >It is curious to me that decorating something isn't the same as modifying it

      I received many comments about this

      Strictly speacking, decorating something is the same as modifying it

      I tried to explain the non-strict perspective by adding the folloiwng paragraph to approach 3: Also, the use of a decorator might seem to violate the constraint that the original declaration of SomeResults could not be modified. The strict interpretation is ...

      Delete
    4. >Are there some API properties between components here that must be maintained, and "can't modify SomeResults" is just a shorthand for saying so?

      Yes

      Delete
    5. >Is there some reason why duck type polymorphism wouldn't work here?

      Humm

      I read the words, but don't understand

      Could you provide a working code snippet demonstrating the idea

      Delete
  7. The original code had both "list[str] | None" and "Optional[List[str]]". I received comments about this.

    The blog post has been updated by adding the section titled: "Optional[list[str]]" Should Replace "list[str] | None". Also, all the code snippets have been updated to use "list[str] | None".

    In Python 3.10, unions could be written as X | Y. Also, there was PEP 604 titled "Allow writing union types as X | Y". On top of that, the current documentation states "To define a union, use e.g. Union[int, str] or the shorthand int | str. Using that shorthand is recommended.".

    However, the twist for this particular use case is that explicit value of None is allowed. So, we should follow the official documentation which states "if an explicit value of None is allowed, the use of Optional is appropriate, whether the argument is optional or not".

    The double twist for this particular use case is that he constraint imposed by the client was that the original declaration of SomeResults could not be modified. So, backk to using "list[str] | None".

    ReplyDelete
  8. This is a great primer! Extremely helpful for anyone choosing between data containers in Python.

    While walking us through five different approaches, I appreciate your balancing practical implementation with conceptual explanation.

    ReplyDelete