"Constants" are variables whose values are intended to remain unchanged throughout the program's execution. Unlike some other programming languages, Python does not have a built-in mechanism to enforce immutability for constants. However, there are approaches that can be used to indicate that a variable is a constant.
One approach is to rely on a widely adopted naming convention. A second approach is to use typing.Final. A third approach is to use slots. The rest of this blog will address each approarch in turn.
Approach 1: Naming Convention Using All Capital Letters
The disadvantage is that it requires developers to remember to follow the naming convention and reviewers to remember to check for the naming convention. On top of that, there can be no static or runtime checking.
The advantage is that no additional code besides declaring the "constnat" is needed.
Approach 2: Use typing.Final
Let's demonstrate how typing.Final can detect "constants" being modified via a code snippet. Create a file with the name constants.py with the following contents:
from typing import Final
CONSTANT_A: Final = "A"
CONSTANT_B: Final = "B"
CONSTANT_A = "Z" # Line 6
When run "mypy constants.py", the following output is generated:
constants.py:6: error: Cannot assign to final name "CONSTANT_A" [misc]
Found 1 error in 1 file (checked 1 source file)
A disadvantage is that it produces no runtime errors.
>>> from typing import Final
>>> CONSTANT_A: Final = "A"
>>> CONSTANT_B: Final = "B"
>>> CONSTANT_A = "Z" # Line 6
>>>
However, if use a static type checker, the code won't pass basic QA and consequently it will never make it to production.
The advantage of this approach is that it gurantees that "constants" cannot be modified.
Approach 3: Slots
Let's demonstrate how slots can be used to create constants via a working code snippet
We have discussed 3 different approaches. I personally like approach 2 where typing.Final is used. It is simple to implement and it can be used to gurantee that a "constant variable" is never changed.
My task was to create the implementation. There were clear instructions on what the implementation was to output, and initial instructions were provided on how to get started.
However, I was puzzled as to how to structure the code. Do I access object attributes directly? Do I use a namedtuple? Do I use a dataclass? Do I use Pydantic? We will explore each of these below with their associated pros and cons.
The applicable requirements and restrictions imposed by the client are not going to be stated upfront. This way, the reader is forced to work through the thought process/approaches presented here. Anyways, sometimes everyone needs a little mystery. Also, in real life, it is rare to have all the requirements and restrictions stated upfront.
"Optional[list[str]]" Should Replace "list[str] | None"
The advantage of the above approach is that it is quick. No manual implementation of methods like __init__().
The disadvantage of the above approach is that setting individual attributes directly using dot notation. This is not considered Pythonic. An alternative approach is to use getters and setter methods. Getters and setters are a legacy pattern from C++ that make library packaging practical and avoid recompiling the entire world. They should be avoided in Python. Another approach is to use properties. This is considered to be Pythonic. These different approaches generate much controversy. Consequently, the following links will allow you to gather information that can then be used to make a decision that is appropriate to your use case and time constraints.
The advantage of namedtuples is that they are immutable. This immutability is helpful because you want to combine the results of each of the topics once at the end. You don't want to combine the results of each of the topics over and over.
The disadvantage of namedtuples is that one has to provide a value for each topic or explicitly default each value to None (SomeResults = namedtuple('SomeResults', ['topic_1', 'topic_2', 'topic_3'], defaults=(None, None, None))). This seems trivial. Unfortunately, this client had so many topics that manually defaulting all to None was impractical.
The advantage of using a dataclass is that it is a "natural" fit because SomeResults is a class primarily used for storing data. Also, it automatically generates boilerplate methods.
The disadvantage of dataclasses is that there is no runtime data validation.
Also, the use of a decorator might seem to violate the constraint that the original declaration of SomeResults could not be modified. The strict interpretation is that we have modified the original declaration through the use of a decorator. However, it is a local modification of the implementation for a specific purpose. As a side note, if you are ever in a situation where you can't modify the code but at the same time you have to modify the code, think decorators.
This particular client was processing web pages from the internet, and so automatic runtime data validation was needed. This makes Pydantic a natural fit.
A con would be that Pydantic introduces an external dependency. The alternative is to write, debug, and maintain the equivalent code for this particular use case yourself. Not sure how realistic that would be. Also, by introducing Pydantic to your tech stack, a lot of useful functionality becomes available like JSON conversions.
Another con is that there is a higher overhead that arises from the validation.
Also, notice that the original class SomeResults is modified to be a subclass of BaseModel. For this particular client, this is not just a con but a deal breaker. The original class SomeResults cannot be modified.
Pydantic dataclass decorator satisfies all the requirements of the client. It supports runtime data validation. Also, no changes have been made to the original definition of the class SomeResults.
Summary
As shown above, there are many ways to ensure that some function returns a specific type of output. It is realized that the 5 approaches are not a thorough listing of all the possible approaches. However, they are illustrative, and there are length constraints imposed by people casually reading blog posts.
We started with "Approach 1," which is the simplest. We then used namedtuples. Unfortunately, this particular client could not use them because they needed the ability to default values to None. This forced us to move on to dataclasses. However, this particular client needed runtime data validation and so Pydantic was needed. We still did not meet the client's requirements because we modified the original class SomeResults. We then used Pydantic's dataclass decorator so that we did not have to modify the class SomeResults.
One of the common tasks that I have to do is identify nouns and verbs in text. If I am doing object oriented programming, the nouns will help in deciding which classes to create. Even if I am not doing object oriented programming, having a list of nouns and verbs will facilitate understanding.
Python Advanced Concepts
Some of the Python code used might be considered "advanced".
For example, we will use a list of tuples. In my opinion, going for ["txt_1", "txt_2"] to [("txt_1a", "txt_1b"), ("txt_2a", "txt_2b")] is no big deal. Refer to output 2 for how it applies in this write-up.
Also, Counter() is used to count the number of times that a particular tuple occurs. Refer to output 3 for how it applies in this write-up.
The contents of Counter() is a dictionary of dictionaries. The key of the inner dictionary is a tuple. Refer to output 3 for how it applies in this write-up.
Natural Language Toolkit (NLTK)
NLTK is used for tokenization and then assigning parts of speech (POS) to those tokens. Unfortunately, NLTK is large and complex. The reference section at the end provides links and short text snippets for the applicable NLTK components.
To go to the file containing the working code click here.
The structure of the code is as follows
Create text data
Process the text to tokenize it, assign a POS to it, and then count the combination of tokens and POS.
Filter data such that only nouns and verbs remain.
Sort data such that the output has desired format.
Create Text Data
The test data is created as follows
Code Snippet 1
TEXT_1 = """
This is chunk one of the text.
The leaves on a tree are green.
The leaves on a tree aren't red.
"""
TEXT_2 = """
This is chunk two of the text.
The color of the car is red.
The color of the car also contains blue.
The color of the car isn't green.
"""
LIST_OF_STRINGS = [TEXT_1, TEXT_2]
TEXT_1 and TEXT_2 are easily modifiable so that they can be altered by reader with their own test data. Also, it is easy to expand the data to use TEXT_3, TEXT_4 and so on because the processing is based on a list of text contained in LIST_OF_STRINGS.
Process Text: tokenize, assign POS to tokens, count combinations of token and POS
The processing of text is accomplished with the following code snippet.
Code Snippet 2
# For the combination of a word and its part of speech,
# create a count of its associated appearance
for i, string in enumerate(LIST_OF_STRINGS):
# Decompose string into tokens
string_tokenized = nltk.word_tokenize(string)
# Assign part of speech to each token
string_pos_tag = nltk.pos_tag(string_tokenized)
# Count the number of token, parts of speech combinations
if i == 0:
count_token_pos = Counter(string_pos_tag)
else:
count_token_pos.update(Counter(string_pos_tag))
Notice that the contents of Counter() is a dictionary of dictionaries. The key of the inner dictionary is a tuple. The tuple key consists of a token and a POS as shown in output 2. The integer is simply the number of times that the tuple occurs.
Filter Data for Nouns & Verbs
The first step is to identify the POS that correspond to nouns and verbs.
This is implemented via the following code snippet.
The filtering is implemented via the following code snippet.
Code Snippet 4
list_noun_count = []
list_verb_count = []
for token_pos in count_token_pos:
if token_pos[1] in NOUN_POS:
list_noun_count.append((token_pos[0], count_token_pos[token_pos]))
elif token_pos[1] in VERB_POS:
list_verb_count.append((token_pos[0], count_token_pos[token_pos]))
Notice that the output is a list of tuples. Also, each tuple consists of a string and an integer. This will be important when the lists are sorted.
Sort Data & Generate Output
One set of output will be nouns and their associated counts. This needs to be sorted either by the nouns alphabetically or by their counts. Otherwise, the data will be in some sort of random order. Similarly, another set of output will be be verbs and their counts which will also need to be sorted.
This is implemented via the following code snippet.
Code Snippet 5
# Sort data alphabetically
list_noun_count.sort()
list_verb_count.sort()
with open("noun_counts_alphabetized_by_noun.txt", "w", encoding="utf8") as file:
for noun_count in list_noun_count:
file.write(noun_count[0] + ", " + str(noun_count[1]) + "\n")
with open("verb_counts_alphabetized_by_verb.txt", "w", encoding="utf8") as file:
for verb_count in list_verb_count:
file.write(verb_count[0] + ", " + str(verb_count[1]) + "\n")
# Sort data by their counts
list_noun_count.sort(key=lambda noun_count: noun_count[1], reverse=True)
list_verb_count.sort(key=lambda noun_count: noun_count[1], reverse=True)
with open("noun_counts_by_increasing_count.txt", "w", encoding="utf8") as file:
for noun_count in list_noun_count:
file.write(noun_count[0] + ", " + str(noun_count[1]) + "\n")
with open("verb_counts_by_increasing_count.txt", "w", encoding="utf8") as file:
for verb_count in list_verb_count:
file.write(verb_count[0] + ", " + str(verb_count[1]) + "\n")
The second thing to note is that specified a key for sorting by count via lambda noun_count: noun_count[1]. Reacall from Output 4.A and 4.B that we are sorting a list of tuples. If people need a refresher on sorting, consider reading "How to Use sorted() and .sort() in Python" by David Fundakowski.
Lastly, please make a habit of specifying utf8 as the encoding when creating a file. Unfortunately, Python uses the OS default for encoding [locale.getencoding()]. This is especially important when using the Windows operating system because it defaults to cp1252.
The NLTK corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with NLTK.
Individual packages can be downloaded by calling the download() function with a single argument, giving the package identifier for the package that should be downloaded:
This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.
A “tag” is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):
>>> tagged_tok = ('fly', 'NN')
An off-the-shelf tagger is available for English. It uses the Penn Treebank tagset:
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.
The NLTK data package includes a pre-trained Punkt tokenizer for English.
Generating synthetic data has been a topic of interest since the early days of conventional databases. Recently, it has become a hot topic because of the large language model (LLM) phenomena. Basically everyone has scrapped the internet and now more data is needed. Also, higher quality data is needed. For further details, refer to the video "Jonathan Ross, Founder & CEO @ Groq: NVIDIA vs Groq - The Future of Training vs Inference | E1260" at 20VC with Harry Stebbings. Start listening 2 minutes in and then just listen for 3 minutes.
This blog post contains a list of references which hopefully will save others some leg work.
If you're looking for a Python library to perform differential privacy computations, diffprivlib seems to be an attractive choice. You'll find it prominently featured in Google search results. It's maintained by IBM, and extensively cited in the scientific literature. Its README states that you can use it to "build your own differential privacy applications"1, and it's regularly updated. Last but not least, it's very easy to pick up: its API mimics well-known tools like NumPy or scikit-learn, making it look simple and familiar to data scientists.
Unfortunately, diffprivlib is flawed in a number of important ways. I think most people should avoid using it. This blog post lists a few reasons why.