While working for a client, I was given code with the following structure.
There was a SomeResults class
>>> class SomeResults:
... topic_1: list[str] | None
... topic_2: list[str] | None
... topic_3: list[str] | None
...
>>>
There was some function that had to return SomeResults
>>> def some_fcn() -> SomeResults:
... raise NotImplemented()
...
>>>
My task was to create the implementation. There were clear instructions on what the implementation was to output, and personal interactions were provided on how to get started.
However, I was puzzled as to how to structure the code. Do I access object attributes directly? Do I use a namedtuple? Do I use a dataclass? Do I use Pydantic? We will explore each of these below with their associated pros and cons.
The applicable requirements and restrictions imposed by the client are not going to be stated upfront. This way, the reader is forced to work through the thought process/approaches presented here. Anyways, sometimes everyone needs a little mystery. Also, in real life, it is rare to have all the requirements and restrictions stated upfront.
The Python documentation states, "if an explicit value of None is allowed, the use of Optional is appropriate, whether the argument is optional or not". Unfortunately, one of the constraints imposed by the client was that the original declaration of SomeResults could not be modified. Consequently, have to use "list[str] | None".
If you are interested in the details, refer to
The final code would look something like the following
>>> def some_fcn() -> SomeResults:
... topic_a = ["topic_1x", "topic_1y", "topic_1z"]
... topic_b = ["topic_2x", "topic_2y", "topic_2z"]
... topic_c = ["topic_3x", "topic_3y", "topic_3z"]
... result = SomeResults()
... result.topic_1 = topic_a
... result.topic_2 = topic_b
... result.topic_3 = topic_c
... return result
...
>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']
>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']
>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']
The computation of the topics is complex, so that it would be done separately. This is mimicked by
topic_a = ["topic_1x", "topic_1y", "topic_1z"]
topic_b = ["topic_2x", "topic_2y", "topic_2z"]
topic_c = ["topic_3x", "topic_3y", "topic_3z"]
Once the topics are computed, they are gathered together to create a result.
result = SomeResults()
result.topic_1 = topic_a
result.topic_2 = topic_b
result.topic_3 = topic_c
The advantage of the above approach is that it is quick. No manual implementation of methods like __init__()
.
The disadvantage of the above approach is that it sets individual attributes directly using dot notation. This is not considered Pythonic. An alternative approach is to use getters and setter methods. Getters and setters are a legacy from C++ that make library packaging practical and avoid recompiling the entire world. They should be avoided in Python. Another approach is to use properties. This is considered to be Pythonic. These different approaches generate much controversy. Consequently, the following links will allow you to gather information that can then be used to make a decision that is appropriate to your use case and time constraints.
The final code would look something like the following
>>> from collections import namedtuple
>>> SomeResults = namedtuple('SomeResults', ['topic_1', 'topic_2', 'topic_3'])
>>> def some_fcn() -> SomeResults:
... topic_a = ["topic_1x", "topic_1y", "topic_1z"]
... topic_b = ["topic_2x", "topic_2y", "topic_2z"]
... topic_c = ["topic_3x", "topic_3y", "topic_3z"]
... return SomeResults(
... topic_1=topic_a,
... topic_2=topic_b,
... topic_3=topic_c
... )
...
>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']
>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']
>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']
>>>
The advantage of namedtuples is that they are immutable. This immutability attribute is helpful because you want to combine the results of each of the topics once at the end. You don't want to combine the results of each of the topics over and over.
The disadvantage of namedtuples is that one has to provide a value for each topic or explicitly default each value to None (SomeResults = namedtuple('SomeResults', ['topic_1', 'topic_2', 'topic_3'], defaults=(None, None, None))). This seems trivial. Unfortunately, this particular client had so many topics that it would be annoying to initially set all the topics to None.
If you want to brush up on namedtuples, consider using the article "Write Pythonic and Clean Code With namedtuple" by Leodanis Pozo Ramos.
The final code would look something like the following
>>> from dataclasses import dataclass
>>> @dataclass
... class SomeResults:
... topic_1: list[str] | None
... topic_2: list[str] | None
... topic_3: list[str] | None
...
>>> def some_fcn() -> SomeResults:
... topic_a = ["topic_1x", "topic_1y", "topic_1z"]
... topic_b = ["topic_2x", "topic_2y", "topic_2z"]
... topic_c = ["topic_3x", "topic_3y", "topic_3z"]
... return SomeResults(
... topic_1=topic_a,
... topic_2=topic_b,
... topic_3=topic_c
... )
...
>>> print(some_fcn())
SomeResults(topic_1=['topic_1x', 'topic_1y', 'topic_1z'], topic_2=['topic_2x', 'topic_2y', 'topic_2z'], topic_3=['topic_3x', 'topic_3y', 'topic_3z'])
The advantage of using a dataclass is that it is a "natural" fit because SomeResults is a class primarily used for storing data. Also, it automatically generates boilerplate methods.
The disadvantage of dataclasses is that there is no runtime data validation.
Also, the use of a decorator might seem to violate the constraint that the original declaration of SomeResults could not be modified. The strict interpretation is that we have modified the original declaration through the use of a decorator. However, it is a local modification of the implementation for a specific purpose. As a side note, if you are ever in a situation where you can't modify the code but at the same time you have to modify the code, think decorators.
If you want to brush up on dataclasses, consider using the article "Data Classes in Python 3.7+ (Guide)" by Geir Arne Hjelle.
The final code would look something like the following
>>> from pydantic import BaseModel
>>> class SomeResults(BaseModel):
... topic_1: list[str] | None
... topic_2: list[str] | None
... topic_3: list[str] | None
...
>>> def some_fcn() -> SomeResults:
... topic_a = ["topic_1x", "topic_1y", "topic_1z"]
... topic_b = ["topic_2x", "topic_2y", "topic_2z"]
... topic_c = ["topic_3x", "topic_3y", "topic_3z"]
... return SomeResults(
... topic_1=topic_a,
... topic_2=topic_b,
... topic_3=topic_c
... )
...
>>> print(some_fcn())
topic_1=['topic_1x', 'topic_1y', 'topic_1z'] topic_2=['topic_2x', 'topic_2y', 'topic_2z'] topic_3=['topic_3x', 'topic_3y', 'topic_3z']
This particular client was processing web pages from the internet, and so automatic runtime data validation was needed. This makes Pydantic a natural fit.
A con would be that Pydantic introduces an external dependency. The alternative is to write, debug, and maintain the equivalent code for this particular use case yourself. Not sure how realistic that would be. Also, by introducing Pydantic to your tech stack, a lot of useful functionality becomes available like JSON conversions.
Another con is that there is a higher overhead that arises from the validation.
Also, notice that the original class SomeResults is modified to be a subclass of BaseModel. For this particular client, this is not just a con but a deal breaker. The original class SomeResults cannot be modified.
If you want to brush up on Pydantic, consider using the article "Pydantic: Simplifying Data Validation in Python" by Harrison Hoffman.
The final code would look something like the following
>>> from pydantic.dataclasses import dataclass
>>> @dataclass
... class SomeResults:
... topic_1: list[str] | None
... topic_2: list[str] | None
... topic_3: list[str] | None
...
>>> def some_fcn() -> SomeResults:
... topic_a = ["topic_1x", "topic_1y", "topic_1z"]
... topic_b = ["topic_2x", "topic_2y", "topic_2z"]
... topic_c = ["topic_3x", "topic_3y", "topic_3z"]
... return SomeResults(
... topic_1=topic_a,
... topic_2=topic_b,
... topic_3=topic_c
... )
...
>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']
>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']
>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']
>>>
Pydantic dataclass decorator satisfies all the requirements of the client. It supports runtime data validation. Also, no changes have been made to the original definition of the class SomeResults.
As shown above, there are many ways to ensure that some function returns a specific type of output. It is realized that the 5 approaches are not a thorough listing of all the possible approaches. However, they are illustrative, and there are length constraints imposed by people casually reading blog posts.
We started with "Approach 1," which is the simplest. We then used namedtuples. Unfortunately, this particular client could not use them because they needed the ability to default values to None. This forced us to move on to dataclasses. However, this particular client needed runtime data validation and so Pydantic was needed. We still did not meet the client's requirements because we modified the original class SomeResults. We then used Pydantic's dataclass decorator so that we did not have to modify the class SomeResults.