While working for a client, I was given code with the following structure.
There was a SomeResults class
>>> class SomeResults:
... topic_1: list[str] | None
... topic_2: list[str] | None
... topic_3: list[str] | None
...
>>>
There was some function that had to return SomeResults
>>> def some_fcn() -> SomeResults:
... raise NotImplemented()
...
>>>
My task was to create the implementation. There were clear instructions on what the implementation was to output, and personal interactions were provided on how to get started.
However, I was puzzled as to how to structure the code. Do I access object attributes directly? Do I use a namedtuple? Do I use a dataclass? Do I use Pydantic? We will explore each of these below with their associated pros and cons.
The applicable requirements and restrictions imposed by the client are not going to be stated upfront. This way, the reader is forced to work through the thought process/approaches presented here. Anyways, sometimes everyone needs a little mystery. Also, in real life, it is rare to have all the requirements and restrictions stated upfront.
The Python documentation states, "if an explicit value of None is allowed, the use of Optional is appropriate, whether the argument is optional or not". Unfortunately, one of the constraints imposed by the client was that the original declaration of SomeResults could not be modified. Consequently, have to use "list[str] | None".
If you are interested in the details, refer to
The final code would look something like the following
>>> def some_fcn() -> SomeResults:
... topic_a = ["topic_1x", "topic_1y", "topic_1z"]
... topic_b = ["topic_2x", "topic_2y", "topic_2z"]
... topic_c = ["topic_3x", "topic_3y", "topic_3z"]
... result = SomeResults()
... result.topic_1 = topic_a
... result.topic_2 = topic_b
... result.topic_3 = topic_c
... return result
...
>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']
>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']
>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']
The computation of the topics is complex, so that it would be done separately. This is mimicked by
topic_a = ["topic_1x", "topic_1y", "topic_1z"]
topic_b = ["topic_2x", "topic_2y", "topic_2z"]
topic_c = ["topic_3x", "topic_3y", "topic_3z"]
Once the topics are computed, they are gathered together to create a result.
result = SomeResults()
result.topic_1 = topic_a
result.topic_2 = topic_b
result.topic_3 = topic_c
The advantage of the above approach is that it is quick. No manual implementation of methods like __init__()
.
The disadvantage of the above approach is that it sets individual attributes directly using dot notation. This is not considered Pythonic. An alternative approach is to use getters and setter methods. Getters and setters are a legacy from C++ that make library packaging practical and avoid recompiling the entire world. They should be avoided in Python. Another approach is to use properties. This is considered to be Pythonic. These different approaches generate much controversy. Consequently, the following links will allow you to gather information that can then be used to make a decision that is appropriate to your use case and time constraints.
The final code would look something like the following
>>> from collections import namedtuple
>>> SomeResults = namedtuple('SomeResults', ['topic_1', 'topic_2', 'topic_3'])
>>> def some_fcn() -> SomeResults:
... topic_a = ["topic_1x", "topic_1y", "topic_1z"]
... topic_b = ["topic_2x", "topic_2y", "topic_2z"]
... topic_c = ["topic_3x", "topic_3y", "topic_3z"]
... return SomeResults(
... topic_1=topic_a,
... topic_2=topic_b,
... topic_3=topic_c
... )
...
>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']
>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']
>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']
>>>
The advantage of namedtuples is that they are immutable. This immutability attribute is helpful because you want to combine the results of each of the topics once at the end. You don't want to combine the results of each of the topics over and over.
The disadvantage of namedtuples is that one has to provide a value for each topic or explicitly default each value to None (SomeResults = namedtuple('SomeResults', ['topic_1', 'topic_2', 'topic_3'], defaults=(None, None, None))). This seems trivial. Unfortunately, this particular client had so many topics that it would be annoying to initially set all the topics to None.
If you want to brush up on namedtuples, consider using the article "Write Pythonic and Clean Code With namedtuple" by Leodanis Pozo Ramos.
The final code would look something like the following
>>> from dataclasses import dataclass
>>> @dataclass
... class SomeResults:
... topic_1: list[str] | None
... topic_2: list[str] | None
... topic_3: list[str] | None
...
>>> def some_fcn() -> SomeResults:
... topic_a = ["topic_1x", "topic_1y", "topic_1z"]
... topic_b = ["topic_2x", "topic_2y", "topic_2z"]
... topic_c = ["topic_3x", "topic_3y", "topic_3z"]
... return SomeResults(
... topic_1=topic_a,
... topic_2=topic_b,
... topic_3=topic_c
... )
...
>>> print(some_fcn())
SomeResults(topic_1=['topic_1x', 'topic_1y', 'topic_1z'], topic_2=['topic_2x', 'topic_2y', 'topic_2z'], topic_3=['topic_3x', 'topic_3y', 'topic_3z'])
The advantage of using a dataclass is that it is a "natural" fit because SomeResults is a class primarily used for storing data. Also, it automatically generates boilerplate methods.
The disadvantage of dataclasses is that there is no runtime data validation.
Also, the use of a decorator might seem to violate the constraint that the original declaration of SomeResults could not be modified. The strict interpretation is that we have modified the original declaration through the use of a decorator. However, it is a local modification of the implementation for a specific purpose. As a side note, if you are ever in a situation where you can't modify the code but at the same time you have to modify the code, think decorators.
If you want to brush up on dataclasses, consider using the article "Data Classes in Python 3.7+ (Guide)" by Geir Arne Hjelle.
The final code would look something like the following
>>> from pydantic import BaseModel
>>> class SomeResults(BaseModel):
... topic_1: list[str] | None
... topic_2: list[str] | None
... topic_3: list[str] | None
...
>>> def some_fcn() -> SomeResults:
... topic_a = ["topic_1x", "topic_1y", "topic_1z"]
... topic_b = ["topic_2x", "topic_2y", "topic_2z"]
... topic_c = ["topic_3x", "topic_3y", "topic_3z"]
... return SomeResults(
... topic_1=topic_a,
... topic_2=topic_b,
... topic_3=topic_c
... )
...
>>> print(some_fcn())
topic_1=['topic_1x', 'topic_1y', 'topic_1z'] topic_2=['topic_2x', 'topic_2y', 'topic_2z'] topic_3=['topic_3x', 'topic_3y', 'topic_3z']
This particular client was processing web pages from the internet, and so automatic runtime data validation was needed. This makes Pydantic a natural fit.
A con would be that Pydantic introduces an external dependency. The alternative is to write, debug, and maintain the equivalent code for this particular use case yourself. Not sure how realistic that would be. Also, by introducing Pydantic to your tech stack, a lot of useful functionality becomes available like JSON conversions.
Another con is that there is a higher overhead that arises from the validation.
Also, notice that the original class SomeResults is modified to be a subclass of BaseModel. For this particular client, this is not just a con but a deal breaker. The original class SomeResults cannot be modified.
If you want to brush up on Pydantic, consider using the article "Pydantic: Simplifying Data Validation in Python" by Harrison Hoffman.
The final code would look something like the following
>>> from pydantic.dataclasses import dataclass
>>> @dataclass
... class SomeResults:
... topic_1: list[str] | None
... topic_2: list[str] | None
... topic_3: list[str] | None
...
>>> def some_fcn() -> SomeResults:
... topic_a = ["topic_1x", "topic_1y", "topic_1z"]
... topic_b = ["topic_2x", "topic_2y", "topic_2z"]
... topic_c = ["topic_3x", "topic_3y", "topic_3z"]
... return SomeResults(
... topic_1=topic_a,
... topic_2=topic_b,
... topic_3=topic_c
... )
...
>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']
>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']
>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']
>>>
Pydantic dataclass decorator satisfies all the requirements of the client. It supports runtime data validation. Also, no changes have been made to the original definition of the class SomeResults.
As shown above, there are many ways to ensure that some function returns a specific type of output. It is realized that the 5 approaches are not a thorough listing of all the possible approaches. However, they are illustrative, and there are length constraints imposed by people casually reading blog posts.
We started with "Approach 1," which is the simplest. We then used namedtuples. Unfortunately, this particular client could not use them because they needed the ability to default values to None. This forced us to move on to dataclasses. However, this particular client needed runtime data validation and so Pydantic was needed. We still did not meet the client's requirements because we modified the original class SomeResults. We then used Pydantic's dataclass decorator so that we did not have to modify the class SomeResults.
Excellent write-up as always! I love Pydantic but never messed with the decorators so that is a neat thing to be aware of. My initial impression here is that yet another way that this could be potentially done would be to have a `Dict[str, List[str | None]]` attribute keyed on topics, however that really only makes sense if topics are dynamic data. I guess it all depends on the data model.
ReplyDelete>Excellent write-up as always!
DeleteThanks
>I love Pydantic but never messed with the decorators so that is a neat thing to be aware of
I too was not aware of it until recently. Luckly, I had a good friend that made me aware of it via a private email. Also, several other people commented that they were not aware of Pydantic's dataclass.
>... here is ... yet another way ... Dict[str, List[str | None]]
Thanks for proposing an alternative data structure
>I guess it all depends on the data model
Yeah - Creating the data model / data structure for a particular use case is tough
>... it all depends
:-)
Great article, Robert! I have not had much opportunity to examine this library, but it seems very powerful. Agree that the obsession of doing everything inside Python native does not make sense.
ReplyDeleteI have had mixed success with 'reflection' solutions since they depend on internal knowledge and immutable internal structure. Will take a look at this library for sure.
>Great article, Robert!
DeleteThanks
>Agree that the obsession of doing everything inside Python native does not make sense
Yeah - I think that is one of the reasons that Python became so popular - There is a basic simple core Python language - Then you have an entire ecosystem from which to select a solution that works for your particular use case
Nice description. Good examples and thanks for sharing. Personally, I don't find option 1 horrifying, so I'm not convinced that it's an antipattern. And with other libraries like [typeguard](https://pypi.org/project/typeguard/), you might get runtime validation. And if you are able to re-define SomeResults, you can easily set `None` as the unset default.
ReplyDeleteI'm also reluctant to accept the use of Pydantic. It's more than just a dependency. It's a _non-pure_ dependency, meaning it will only work in environments where extension modules can be compiled and installed. Granted, that's most environments, but it excludes use-cases such as vendored libraries, cygwin, web-based interpreters, and embedded environments. For example, I tried to use pydantic in inflect, but found that [non viable](https://github.com/jaraco/inflect/issues/195). Pydantic is wonderful and I love what it does, but be considerate about the constraints it creates when adopting it.
>Nice description. Good examples
DeleteThanks
>Personally, I don't find option 1 horrifying, so I'm not convinced that it's an antipattern
Yeah - I made the mistake of using too strong of a wording. I have added an entire paragraph at the end of approach 1 which starts with "The disadvantage of the above approach is that setting individual attributes directly using dot notation. This is not considered Pythonic." - There are links afterwards which will allow people to explore the topic for themselves and they can make their own choice.
>typeguard
typeguard: 1.7k stars on github
Pydantic: 24.7k stars on github
>... tried to use pydantic in inflect, but found that [non viable](https://github.com/jaraco/inflect/issues/195)
>Pydantic is wonderful and I love what it does
>but be considerate about the constraints it creates when adopting it
Fair enough
I personally like native classes and try to think twice before moving from them. However, nowadays it's unlikely that you are working with native Python code only. You probably already using Pydantic (or an alternative) in the project, so using dataclass from Pydantic (or even BaseModel, if you have a bit more complex use-case) is a good practical choice.
ReplyDeletePython has unfortunately drank the Dependency Hell kool-aid and now has to lie in this bed. Even if you brush aside the security implications, it is aggravating that even simple projects now require multiple gigabytes of dependencies to be acquired, which sucks especially bad if you are working in a highly modular application. Won't someone think of the SSD wear leveling?! :)
Delete>Python has unfortunately drank the Dependency Hell kool - aid
Delete>Even if you brush aside the security implications
What is the alternative?
Writte your own?
You can still elect to do that if you wish
>require multiple gigabytes of dependencies
Interesting
When you get a chance please provide some examples
I can't remember not even once where this happened to me
>SSD wear leveling
At least the SSD vendors will be happy :-)
Thanks for sharing the multiple pythonic approaches to achieve one thing. My preference, given Python is a glue tool, is actually approach 1 as a starting point -- it is native Python, clean and easy to read and maintain. If one needs to make it "pattern", there are tricks such as gets and sets; explicit type validation is not too ugly to add; if such needs repeat in many places, then consider using decorators for example.
ReplyDelete(if one need fast running time, and the impmentnation here is profiled to be the bottleneck, maybe considering using C and pybinding, or other language directly)
>My preference ... is actually approach 1 as a starting point
DeleteNotice the use of the phrase "as a starting point"
I have added an entire paragraph at the end of approach 1 which starts with "The disadvantage of the above approach is that setting individual attributes directly using dot notation. This is not considered Pythonic." - There are links afterwards which will allow people to explore the topic for themselves and they can make their own choice.
>gets and sets
The last paragraph of approach 1 states: An alternative approach is to use getters and setter methods. Getters and setters are a left over from C++ to make library packaging practical and avoid recompiling the world. They should be avoided in Python..
> if one need fast running time and the impmentnation here is profiled to be the bottleneck maybe consider using ...
Thanks for the suggestions
Unfortunately, I cannot include ways to address the various cons. It would make the blog post to long. People expect blog posts to be a quick read and if they are not, they will move on to others that are.
Can you elaborate on how technique 5 does not make any changes to the original definition of SomeResults? It seems to me that you are re-defining it (though I _think_ it has the same definition? Is that acceptable?)
ReplyDeleteAnother question is, is there a significant semantic difference between a None value and an empty list? I'm getting the impression this is attempting to get closer to "make invalid states unrepresentable", but I'm not sure. Either way, it appears that the original definition has that, so.
Oh, rereading, I see that you address this (though I didn't realize it at first): "As a side note, if you are ever in a situation where can't modify the code but at the same time you have to modify the code, think decorators."
DeleteIt is curious to me that decorating something isn't the same as modifying it.
Are there some API properties between components here that must be maintained, and "can't modify SomeResults" is just a shorthand for saying so? Is there some reason why duck type polymorphism wouldn't work here?
>is there a significant semantic difference between a None value and an empty list?
DeleteHumm
There are multiple ways to look at this
Approach 1: Data Types
>>> type(None)
>>> type([])
Approach 2: Truthiness
>>> if None: print("Hello")
...
>>> if not None: print("Hello")
...
Hello
>>> if []: print("Hello")
...
>>> if not []: print("Hello")
...
Hello
Approach 3: Equivalent to asking why use NULL in a database
In databases, NULL is a special marker used to indicate that a data value is unknown, missing, or not applicable for a particular column in a row
It is distinct from an empty string, a zero, or any other defined value
Approach 4: Go with what the client wants
In this particular use case, the client wanted an explicit None and so went w/ that
There was no time to engage the client in a conversation about this
>It is curious to me that decorating something isn't the same as modifying it
DeleteI received many comments about this
Strictly speacking, decorating something is the same as modifying it
I tried to explain the non-strict perspective by adding the folloiwng paragraph to approach 3: Also, the use of a decorator might seem to violate the constraint that the original declaration of SomeResults could not be modified. The strict interpretation is ...
>Are there some API properties between components here that must be maintained, and "can't modify SomeResults" is just a shorthand for saying so?
DeleteYes
>Is there some reason why duck type polymorphism wouldn't work here?
DeleteHumm
I read the words, but don't understand
Could you provide a working code snippet demonstrating the idea
The original code had both "list[str] | None" and "Optional[List[str]]". I received comments about this.
ReplyDeleteThe blog post has been updated by adding the section titled: "Optional[list[str]]" Should Replace "list[str] | None". Also, all the code snippets have been updated to use "list[str] | None".
In Python 3.10, unions could be written as X | Y. Also, there was PEP 604 titled "Allow writing union types as X | Y". On top of that, the current documentation states "To define a union, use e.g. Union[int, str] or the shorthand int | str. Using that shorthand is recommended.".
However, the twist for this particular use case is that explicit value of None is allowed. So, we should follow the official documentation which states "if an explicit value of None is allowed, the use of Optional is appropriate, whether the argument is optional or not".
The double twist for this particular use case is that he constraint imposed by the client was that the original declaration of SomeResults could not be modified. So, backk to using "list[str] | None".
This is a great primer! Extremely helpful for anyone choosing between data containers in Python.
ReplyDeleteWhile walking us through five different approaches, I appreciate your balancing practical implementation with conceptual explanation.