Wednesday, July 23, 2025

Pydantic or Dataclass or Namedtuple or Just a Class with Attributes

 

Introduction

While working for a client, I was given code with the following structure.

There was a SomeResults class

>>> class SomeResults:
...     topic_1: list[str] | None
...     topic_2: list[str] | None
...     topic_3: list[str] | None
...
>>>

There was a some function that had to return SomeResults

>>> def some_fcn() -> SomeResults:
...     raise NotImplemented()
...
>>>

My task was to create the implementation. There were clear instructions on what the implementation was to output and personal interactions were provided on how to get started.

However, I was puzzled as to how to structure the code. Do I access object attributes directly? Do I use a namedtuple? Do I use a dataclass? Do I use Pydantic? We will explore each of these below with their associated pros and cons.

Approach 1: Access Object's Attributes Directly

Let me begin by stating that approach 1 should never be used. It is an anti-pattern.

Approach 1 is included for completeness.

The final code would look something like following

>>> def some_fcn() -> SomeResults:

...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]

...     result = SomeResults()
...     result.topic_1 = topic_a
...     result.topic_2 = topic_b
...     result.topic_3 = topic_c

...     return result
...

>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']

>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']

>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']

The computation of the topics is complex and so that would be done separately. This is mimicked by

topic_a = ["topic_1x", "topic_1y", "topic_1z"]

topic_b = ["topic_2x", "topic_2y", "topic_2z"]

topic_c = ["topic_3x", "topic_3y", "topic_3z"]

Once the topics are computed, they are gathered together to create some result.

result = SomeResults()

result.topic_1 = topic_a
result.topic_2 = topic_b
result.topic_3 = topic_c

The advantage of the above approach is that it is quick. No manual implementation of methods like __init__().

The disadvantage of the above approach is that setting individual attributes directly is horrifying. Also, there is no universally expected __init__() constructor.

Let me wrap-up approach 1 by repeating that it should never be used. It is an anti-pattern.

Approach 2: Use Namedtuple

The final code would look something like following

>>> from collections import namedtuple

>>> SomeResults = namedtuple('SomeResults', ['topic_1', 'topic_2', 'topic_3'])

>>> def some_fcn() -> SomeResults:
...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]
...     return SomeResults(
...         topic_1=topic_a,
...         topic_2=topic_b,
...         topic_3=topic_c
...     )
...

>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']

>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']

>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']

>>>

The advantage of namedtuples is that they are immutable. This immutability attribute is helpful because you want to combine the results of each of the topics once at the end. You don't want to combine the the results of each of the topics over and over.

The disadvantage of namedtuples is that they cannot automatically default values to None. This may seem trivial because can just set the value to None. Unfortunately, this particular client had so many topics that it would be annoying to initially set them all to None.

If you want to brush up on namedtuples, consider using the article "Write Pythonic and Clean Code With namedtuple" by Leodanis Pozo Ramos.

Approach 3: Use Dataclass

The final code would look something like following

>>> from dataclasses import dataclass

>>> from typing import List, Optional

>>> @dataclass
... class SomeResults:
...     topic_1: Optional[List[str]]
...     topic_2: Optional[List[str]]
...     topic_3: Optional[List[str]]
...

>>> def some_fcn() -> SomeResults:
...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]
...     return SomeResults(
...         topic_1=topic_a,
...         topic_2=topic_b,
...         topic_3=topic_c
...     )
...

>>> print(some_fcn())
SomeResults(topic_1=['topic_1x', 'topic_1y', 'topic_1z'], topic_2=['topic_2x', 'topic_2y', 'topic_2z'], topic_3=['topic_3x', 'topic_3y', 'topic_3z'])

The advantage of using a dataclass is that it is a "natural" fit because SomeResults is a class primarily used for storing data. Also, it automatically generates boilerplate methods.

The disadvantage of dataclasses is that there is no runtime data validation.

If you want to brush up on dataclasses, consider using the article "Data Classes in Python 3.7+ (Guide)" by Geir Arne Hjelle.

Approach 4: Use Pydantic

The final code would look something like following

>>> from pydantic import BaseModel

>>> from typing import List, Optional

>>> class SomeResults(BaseModel):
...     topic_1: Optional[List[str]]
...     topic_2: Optional[List[str]]
...     topic_3: Optional[List[str]]
...

>>> def some_fcn() -> SomeResults:
...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]
...     return SomeResults(
...         topic_1=topic_a,
...         topic_2=topic_b,
...         topic_3=topic_c
...     )
...

>>> print(some_fcn())
topic_1=['topic_1x', 'topic_1y', 'topic_1z'] topic_2=['topic_2x', 'topic_2y', 'topic_2z'] topic_3=['topic_3x', 'topic_3y', 'topic_3z']

This particular client was processing web pages from the internet and so automatic runtime data validation was needed. This makes Pydantic a natural fit.

A silly con to state would be that Pydantic introduces an external dependency. Actually, it is beyond silly because the point of Python is to have an ecosystem to provide choices. It is not realistic to build systems using just Python's Standard Library.

However, a real con is that there is a higher overhead that arises from the validation.

Also, notice that the original class SomeResults is modified to be a subclass of BaseModel. For this particular client, this is not just a con but a deal breaker. The original class SomeResults cannot be modified.

If you want to brush up on pydantic, consider using the article "Pydantic: Simplifying Data Validation in Python" by Harrison Hoffman.

Approach 5: Use Pydantic dataclass Decorator

The final final code would look something like following

>>> from pydantic.dataclasses import dataclass

>>> from typing import List, Optional

>>> @dataclass
... class SomeResults:
...     topic_1: Optional[List[str]] = None
...     topic_2: Optional[List[str]] = None
...     topic_3: Optional[List[str]] = None
...

>>> def some_fcn() -> SomeResults:
...     topic_a = ["topic_1x", "topic_1y", "topic_1z"]
...     topic_b = ["topic_2x", "topic_2y", "topic_2z"]
...     topic_c = ["topic_3x", "topic_3y", "topic_3z"]
...     return SomeResults(
...         topic_1=topic_a,
...         topic_2=topic_b,
...         topic_3=topic_c
...     )
...

>>> print(some_fcn().topic_1)
['topic_1x', 'topic_1y', 'topic_1z']

>>> print(some_fcn().topic_2)
['topic_2x', 'topic_2y', 'topic_2z']

>>> print(some_fcn().topic_3)
['topic_3x', 'topic_3y', 'topic_3z']

>>>

Pydantic dataclass decorator satisfies all the requirements of the client. It supports runtime data validation. Also, no changes have been made to the original definition of the class SomeResults.

Summary

As shown above, there are many ways to ensure that some function returns a specific type of output. It is realized that the 5 approaches are not a thorough listing of all the possible approaches. However, they are illustrative and there are length constraints imposed by people casually reading blog posts.

We started out with "Approach 1" which is the simplest. We then used namedtuples. Unfortunately, this particular client could not use them because they needed the ability to default values to None. This forced us to move on to dataclasses. However, this particular client needed runtime data validation and so Pydantic was needed. We still did not meet the client's requirements because we modified the orig class SomeResults. We then used Pydantic's dataclass decorator so that we did not have to modify the class SomeResults.

As a side note, if you are ever in a situation where can't modify the code but at the same time you have to modify the code, think decorators.

Would like to end by reiterating that approach 1 should never be used. It is an anti-pattern.

Sunday, March 2, 2025

Identify Nouns & Verbs in Text Using Natural Language Toolkit (NLTK)

Introduction

One of the common tasks that I have to do is identify nouns and verbs in text. If I am doing object oriented programming, the nouns will help in deciding which classes to create. Even if I am not doing object oriented programming, having a list of nouns and verbs will facilitate understanding.

Python Advanced Concepts

Some of the Python code used might be considered "advanced".

For example, we will use a list of tuples. In my opinion, going for ["txt_1", "txt_2"] to [("txt_1a", "txt_1b"), ("txt_2a", "txt_2b")] is no big deal. Refer to output 2 for how it applies in this write-up.

Also, Counter() is used to count the number of times that a particular tuple occurs. Refer to output 3 for how it applies in this write-up.

The contents of Counter() is a dictionary of dictionaries. The key of the inner dictionary is a tuple. Refer to output 3 for how it applies in this write-up.

Natural Language Toolkit (NLTK)

NLTK is used for tokenization and then assigning parts of speech (POS) to those tokens. Unfortunately, NLTK is large and complex. The reference section at the end provides links and short text snippets for the applicable NLTK components.

Computing Environment

Python version 3.12.9 is used.

NLTK version 3.9.1 is used.

I used Anaconda to install NLTK. It did not install nltk.download() by default because it takes approximately 2.5 GB. Consequently, I had to install punkt_tab in order to get nltk.word_tokenize() to work.

>>> import nltk

>>> nltk.download('punkt_tab')

Similarly, I had to install averaged_perceptron_tagger_eng (Source code for nltk.tag.perceptronclass nltk.tag.perceptron.AveragedPerceptron) in order to get nltk.pos_tag() to work.

>>> import nltk

>>> nltk.download('averaged_perceptron_tagger_eng')

Software Structure

To go to the file containing the working code click here.

The structure of the code is as follows

  1. Create text data

    1. Process the text to tokenize it, assign a POS to it, and then count the combination of tokens and POS.

      1. Filter data such that only nouns and verbs remain.

        1. Sort data such that the output has desired format.

          Create Text Data

          The test data is created as follows

          1. Code Snippet 1

            TEXT_1 = """
            This is chunk one of the text.
            
            The leaves on a tree are green.
            
            The leaves on a tree aren't red.
            """
            
            TEXT_2 = """
            This is chunk two of the text.
            
            The color of the car is red.
            
            The color of the car also contains blue.
            
            The color of the car isn't green.
            """
            
            LIST_OF_STRINGS = [TEXT_1, TEXT_2]
            
            

          TEXT_1 and TEXT_2 are easily modifiable so that they can be altered by reader with their own test data. Also, it is easy to expand the data to use TEXT_3, TEXT_4 and so on because the processing is based on a list of text contained in LIST_OF_STRINGS.

          Process Text: tokenize, assign POS to tokens, count combinations of token and POS

          The processing of text is accomplished with the following code snippet.

          1. Code Snippet 2

            # For the combination of a word and its part of speech,
            # create a count of its associated appearance
            for i, string in enumerate(LIST_OF_STRINGS):
            
                # Decompose string into tokens
                string_tokenized = nltk.word_tokenize(string)
            
                # Assign part of speech to each token
                string_pos_tag = nltk.pos_tag(string_tokenized)
            
                # Count the number of token, parts of speech combinations
                if i == 0:
                    count_token_pos = Counter(string_pos_tag)
                else:
                    count_token_pos.update(Counter(string_pos_tag))
            
            

          Tokenize

          Tokenization is accomplished via nltk.word_tokenize().

            Its output is a list of strings.

            1. Output 1

              ['This', 'is', 'chunk', ... 'are', "n't", 'red', '.']
              
              

            Since the nature of the material is introductory, will not worry about edge cases like contractions ( "n't").

              Assign POS to Tokens

              Association of a POS w/ a token is accomplished via nltk.pos_tag().

                Its output is an array of tuples where each tuple consists of a token and a POS.

                1. Output 2

                  [('This', 'DT'), ('is', 'VBZ'), ('chunk', 'JJ'), ... ("n't", 'RB'), ('red', 'JJ'), ('.', '.')]
                  String POS Ta
                  
                  

                'DT' stands for determiner. 'VBZ' stands for a verb that is third person singular present. For the full list, refer to Alphabetical list of part-of-speech tags used in the Penn Treebank Project at UPenn.Edu

                Count Combinations of Token and POS

                Counter() is used to count the number of occurences a combination of token and POS occurs.

                  The output of Counter(string_pos_tag) is

                  1. Output 3 (Just for processing TEXT_1)

                    Counter(
                    
                        {
                    
                            ('.', '.'): 3, 
                    
                            ('The', 'DT'): 2, 
                    
                            ('leaves', 'NNS'): 2, 
                    
                            ...
                    
                            ('green', 'JJ'): 1, 
                    
                            ("n't", 'RB'): 1, 
                    
                            ('red', 'JJ'): 1
                    
                        }
                    
                    )
                    
                    

                  Notice that the contents of Counter() is a dictionary of dictionaries. The key of the inner dictionary is a tuple. The tuple key consists of a token and a POS as shown in output 2. The integer is simply the number of times that the tuple occurs.

                  Filter Data for Nouns & Verbs

                  The first step is to identify the POS that correspond to nouns and verbs.

                    This is implemented via the following code snippet.

                    1. Code Snippet 3

                      NOUN_POS = ["NN", "NNS", "NNP", "NNPS"]
                      
                      VERB_POS = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]
                      
                      

                    'NN' stands for a noun which is singular. 'NNS' stands for a noun which is plural. For the full list, refer to Alphabetical list of part-of-speech tags used in the Penn Treebank Project at UPenn.Edu

                    The filtering is implemented via the following code snippet.

                    1. Code Snippet 4

                      list_noun_count = []
                      list_verb_count = []
                      for token_pos in count_token_pos:
                          if token_pos[1] in NOUN_POS:
                              list_noun_count.append((token_pos[0], count_token_pos[token_pos]))
                          elif token_pos[1] in VERB_POS:
                              list_verb_count.append((token_pos[0], count_token_pos[token_pos]))
                      
                      

                    In the above code, there is nothing worthy of note. Just an if statement inside a for loop. If people need a refresher on iterating through a dictionary, consider reading "How to Iterate Through a Dictionary in Python" by Leodanis Pozo Ramos.

                    The output of this step is provided below.

                    1. Output 4.A: List of nouns and their counts

                      [ ('text', 2), ('leaves', 2), ('tree', 2), ('color', 3), ('car', 3) ]
                      
                      
                    2. Output 4.B: List of verbs and their counts

                      [ ('is', 4), ('are', 2), ('contains', 1) ]
                      
                      

                    Notice that the output is a list of tuples. Also, each tuple consists of a string and an integer. This will be important when the lists are sorted.

                    Sort Data & Generate Output

                    One set of output will be nouns and their associated counts. This needs to be sorted either by the nouns alphabetically or by their counts. Otherwise, the data will be in some sort of random order. Similarly, another set of output will be be verbs and their counts which will also need to be sorted.

                    This is implemented via the following code snippet.

                    1. Code Snippet 5

                      # Sort data alphabetically
                      list_noun_count.sort()
                      list_verb_count.sort()
                      
                      with open("noun_counts_alphabetized_by_noun.txt", "w", encoding="utf8") as file:
                          for noun_count in list_noun_count:
                              file.write(noun_count[0] + ", " + str(noun_count[1]) + "\n")
                      
                      with open("verb_counts_alphabetized_by_verb.txt", "w", encoding="utf8") as file:
                          for verb_count in list_verb_count:
                              file.write(verb_count[0] + ", " + str(verb_count[1]) + "\n")
                      
                      # Sort data by their counts
                      list_noun_count.sort(key=lambda noun_count: noun_count[1], reverse=True)
                      list_verb_count.sort(key=lambda noun_count: noun_count[1], reverse=True)
                      
                      with open("noun_counts_by_increasing_count.txt", "w", encoding="utf8") as file:
                          for noun_count in list_noun_count:
                              file.write(noun_count[0] + ", " + str(noun_count[1]) + "\n")
                      
                      with open("verb_counts_by_increasing_count.txt", "w", encoding="utf8") as file:
                          for verb_count in list_verb_count:
                              file.write(verb_count[0] + ", " + str(verb_count[1]) + "\n")
                      
                      

                    The first thing to note is that sorting a list is done in place. Consequently, can just use list_noun_count.sort(). This is different from when use the built-in function sorted() because it returns a new sorted list.

                    The second thing to note is that specified a key for sorting by count via lambda noun_count: noun_count[1]. Reacall from Output 4.A and 4.B that we are sorting a list of tuples. If people need a refresher on sorting, consider reading "How to Use sorted() and .sort() in Python" by David Fundakowski.

                    Lastly, please make a habit of specifying utf8 as the encoding when creating a file. Unfortunately, Python uses the OS default for encoding [locale.getencoding()]. This is especially important when using the Windows operating system because it defaults to cp1252.

                    The final output is provided below.

                    1. Output 5.A: Nouns sorted by count

                      car, 3
                      color, 3
                      leaves, 2
                      text, 2
                      tree, 2
                      
                      
                    2. Output 5.B: Verbs sorted by count

                      is, 4
                      are, 2
                      contains, 1
                      
                      

                    Alternative Approaches

                    An alternative to NLTK is spaCy. If Google "Identify Nouns & Verbs in Text Using spaCy" will obtain results which will help with starting the journey.

                    Summary

                    Determining nouns and verbs is simple. :-)

                    1. Tokenize using nltk.word_tokenize().

                    2. Associated a POS with each token using nltk.pos_tag().

                    3. Count the number of times that a combination of token and POS occur by using the Python Counter module.

                    4. Filter such that only noun and verb POS remain.

                    5. Sort the data as needed to generate the desired output.

                    However, as is commonly said, the devil is in the details. If you are interested in the details, check out the reference section below.

                    References

                    1. Articles

                      1. A Good Part-of-Speech Tagger in about 200 Lines of Python by Matthew Honnibal

                        1. Alphabetical list of part-of-speech tags used in the Penn Treebank Project at UPenn.Edu

                          1. Tag: NN

                            1. Description: Noun, singular or mass

                            2. ...

                            3. How to Iterate Through a Dictionary in Python by Leodanis Pozo Ramos

                              1. How to Use sorted() and .sort() in Python by David Fundakowski

                                1. NLTK Lookup Error at Stack Over Flow

                                  1. First answer said the missing module is 'the Perceptron Tagger', actually its name in nltk.download is 'averaged_perceptron_tagger'

                                    1. You can use this to fix the error

                                      1. nltk.download('averaged_perceptron_tagger')

                                    2. What to download in order to make nltk.tokenize.word_tokenize work? at Stack Over Flow

                                      1. At home, I downloaded all nltk resources by nltk.download() but, as I found out, it takes ~2.5GB.

                                        1. Try to download ... punkt_tab

                                          1. import nltk

                                            1. nltk.download('punkt_tab')

                                        2. Books

                                        3. Natural Language Toolkit (NLTK)

                                          1. NLTK Documentation

                                            1. All modules for which code is available (_modules)

                                              1. nltk

                                                1. Source code for nltk

                                                  1. nltk.tag

                                                    1. Source code for nltk

                                                      1. nltk.tag.perceptron

                                                        1. Source code for nltk.tag.perceptron

                                                          TAGGER_JSONS = {
                                                          
                                                              "eng": {
                                                          
                                                                  "weights": "averaged_perceptron_tagger_eng.weights.json",
                                                          
                                                                  "tagdict": "averaged_perceptron_tagger_eng.tagdict.json",
                                                          
                                                                  "classes": "averaged_perceptron_tagger_eng.classes.json",
                                                          
                                                              },
                                                          
                                                          
                                                  2. NLTK API

                                                    1. Sub-Packages

                                                      1. nltk.downloader module

                                                        1. Introduction

                                                          1. The NLTK corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with NLTK.

                                                          2. Downloading Packages

                                                            1. Individual packages can be downloaded by calling the download() function with a single argument, giving the package identifier for the package that should be downloaded:

                                                              1. >>> download('treebank')

                                                                1. [nltk_data] Downloading package 'treebank'...

                                                                  1. [nltk_data] Unzipping corpora/treebank.zip.

                                                                2. Downloader

                                                            2. Sub-Packages

                                                              1. nltk.tag package

                                                                1. Module Contents

                                                                  1. NLTK Taggers

                                                                    1. This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.

                                                                      1. A “tag” is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):

                                                                        1. >>> tagged_tok = ('fly', 'NN')

                                                                        2. An off-the-shelf tagger is available for English. It uses the Penn Treebank tagset:

                                                                          1. >>> from nltk import pos_tag, word_tokenize

                                                                            1. >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))

                                                                              1. [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]

                                                                            2. pos_tag()

                                                                              1. NB. Use pos_tag_sents() for efficient tagging of more than one sentence.

                                                                              2. pos_tag_sents()

                                                                              3. Sub-Modules

                                                                            3. nltk.tokenize package

                                                                              1. Module Contents

                                                                                1. NLTK Tokenizer Package

                                                                                  1. Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:

                                                                                    1. >>> from nltk.tokenize import word_tokenize

                                                                                      1. >>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.'''

                                                                                        1. >>> word_tokenize(s)

                                                                                          1. ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

                                                                                            1. This particular tokenizer requires the Punkt sentence tokenization models to be installed.

                                                                                          2. word_tokenize()

                                                                                          3. Sub-Modules

                                                                                            1. nltk.tokenize.punkt module

                                                                                              1. Punkt Sentence Tokenizer

                                                                                                1. This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

                                                                                                  1. The NLTK data package includes a pre-trained Punkt tokenizer for English.

                                                                                                  2. save_punkt_params()

                                                                                    2. Python

                                                                                      1. Binary Data Services

                                                                                        1. codecs - Codec registry and base classes

                                                                                          1. Standard Encodings

                                                                                            1. Codec: cp1252 - Aliases: windows-1252 - Language: Western Europe

                                                                                              1. Codec: utf_8 - Aliases: U8, UTF, utf8, cp65001 - Language: all languages

                                                                                          2. Glossary

                                                                                            1. locale encoding

                                                                                              1. On Windows, it is the ANSI code page (ex: "cp1252").

                                                                                            2. Standard Library

                                                                                          3. spaCy

                                                                                            Tuesday, February 25, 2025

                                                                                            Synthetic Data Generation References

                                                                                             

                                                                                            Generating synthetic data has been a topic of interest since the early days of conventional databases. Recently, it has become a hot topic because of the large language model (LLM) phenomena. Basically everyone has scrapped the internet and now more data is needed. Also, higher quality data is needed. For further details, refer to the video "Jonathan Ross, Founder & CEO @ Groq: NVIDIA vs Groq - The Future of Training vs Inference | E1260" at 20VC with Harry Stebbings. Start listening 2 minutes in and then just listen for 3 minutes.

                                                                                            This blog post contains a list of references which hopefully will save others some leg work.

                                                                                            Articles

                                                                                            1. Don't use diffprivlib by Ted

                                                                                              1. If you're looking for a Python library to perform differential privacy computations, diffprivlib seems to be an attractive choice. You'll find it prominently featured in Google search results. It's maintained by IBM, and extensively cited in the scientific literature. Its README states that you can use it to "build your own differential privacy applications"1, and it's regularly updated. Last but not least, it's very easy to pick up: its API mimics well-known tools like NumPy or scikit-learn, making it look simple and familiar to data scientists.

                                                                                              2. Unfortunately, diffprivlib is flawed in a number of important ways. I think most people should avoid using it. This blog post lists a few reasons why.

                                                                                            2. Synthetic Data by Robert Riemann at the European Data Protection Supervisor

                                                                                            Books

                                                                                            Vendors

                                                                                            Friday, August 16, 2024

                                                                                            Code Snippets Demonstrating Unpacking in Python

                                                                                            Introduction

                                                                                            Unpacking is the process of extracting values from a sequence/iterable and assigning them to multiple variables using only a single line of code.

                                                                                            It is assumed that the reader is familiar with Python listsdictionariestuples and the built-in function zip.

                                                                                            A picture is worth a thousands words. In software, we have a similar concept: a carefully selected code snippet is worth a thousand words. This blog post will use the code snippet approach.

                                                                                            Not Using Asterisks


                                                                                            Example 1.A: multiple assignment using a list

                                                                                            >>> a, b, c = [1, 2, 3]
                                                                                            
                                                                                            >>> a
                                                                                            1
                                                                                            
                                                                                            >>> b
                                                                                            2
                                                                                            
                                                                                            >>> c
                                                                                            3
                                                                                            

                                                                                            Example 1.B: multiple assignment using any iterable

                                                                                            Previous example used a list. This example, uses a string. It illustrate that unpacking works with any iterable.

                                                                                            Code

                                                                                            >>> d, e, f = '456'
                                                                                            
                                                                                            >>> d
                                                                                            '4'
                                                                                            
                                                                                            >>> e
                                                                                            '5'
                                                                                            
                                                                                            >>> f
                                                                                            '6'
                                                                                            

                                                                                            Example 1.C: multiple assignment using a dictionary


                                                                                            Example 1.C.1: just the dictionary keys

                                                                                            Code

                                                                                            >>> x, y = {'g': 7, 'h': 8}
                                                                                            
                                                                                            >>> x
                                                                                            'g'
                                                                                            
                                                                                            >>> y
                                                                                            'h'
                                                                                            

                                                                                            Notice that the above unpacking retrievs only the keys of the dictionary.

                                                                                            Example 1.C.2: the dictionary key and associated value together

                                                                                            For the details on the items() method of a dictionary, click here.

                                                                                            Code

                                                                                            >>> d = {'a': 1, 'b': 2, 'c': 3}
                                                                                            
                                                                                            >>> x, y, z = d.items()
                                                                                            
                                                                                            >>> x
                                                                                            ('a', 1)
                                                                                            
                                                                                            >>> x[0]
                                                                                            'a'
                                                                                            
                                                                                            >>> x[1]
                                                                                            1
                                                                                            
                                                                                            >>> y
                                                                                            ('b', 2)
                                                                                            
                                                                                            >>> z
                                                                                            ('c', 3)
                                                                                            

                                                                                            Single Asterisk (*)


                                                                                            Example 2: single asterisk to unpack a list


                                                                                            Example 2.A: single asterisk to unpack a list into the [ last | first ] assignment variable


                                                                                            Example 2.A.1: single asterisk to unpack a list into the last assignment variable

                                                                                            >>> a, *b = [1, 2, 3]
                                                                                            
                                                                                            >>> a
                                                                                            1
                                                                                            
                                                                                            >>> b
                                                                                            [2, 3]
                                                                                            

                                                                                            Example 2.A.2: single asterisk to unpack a list into the first assignment variable

                                                                                            >>> *c, d = [7, 8, 9]
                                                                                            
                                                                                            >>> c
                                                                                            [7, 8]
                                                                                            
                                                                                            >>> d
                                                                                            9
                                                                                            

                                                                                            Example 2.B: single asterisk to unpack a list into the middle assignment variable


                                                                                            Example 2.B.1: single asterisk to unpack a list into the middle assignment variable - Simple

                                                                                            >>> e, *f, g = [10, 11, 12, 13]
                                                                                            
                                                                                            >>> e
                                                                                            10
                                                                                            
                                                                                            >>> f
                                                                                            [11, 12]
                                                                                            
                                                                                            >>> g
                                                                                            13
                                                                                            

                                                                                            Example 2.B.2: single asterisk to unpack a list into the middle assignment variable - Complex

                                                                                            Code

                                                                                            >>> h, i, *j, k, l = [14, 15, 16, 17, 18, 19]
                                                                                            
                                                                                            >>> h
                                                                                            14
                                                                                            
                                                                                            >>> i
                                                                                            15
                                                                                            
                                                                                            >>> j
                                                                                            [16, 17]
                                                                                            
                                                                                            >>> k
                                                                                            18
                                                                                            
                                                                                            >>> l
                                                                                            19
                                                                                            

                                                                                            Warning: putting too many variable initializations on the same line can cause readability issues. The above code snippet demonstrates this. One has to read the code carefully to determine what gets saved into "*j".

                                                                                            Example 2.C: single asterisk to unpack a list within a list


                                                                                            Example 2.C.1: single asterisk to unpack a list within a list - Once

                                                                                            >>> some_numbers = [1, 2, 3]
                                                                                            
                                                                                            >>> more_numbers = [*some_numbers, 4, 5]
                                                                                            
                                                                                            >>> more_numbers
                                                                                            [1, 2, 3, 4, 5]
                                                                                            

                                                                                            Example 2.C.2: single asterisk to unpack a list within a list - Multiple Times

                                                                                            >>> some_numbers = [1, 2, 3]
                                                                                            
                                                                                            >>> some_other_numbers = [4, 5, 6]
                                                                                            
                                                                                            >>> [*some_numbers, *some_other_numbers, 7, 8]
                                                                                            [1, 2, 3, 4, 5, 6, 7, 8]
                                                                                            

                                                                                            Example 2.D: combine built-in function zip with single asterisk unpacking


                                                                                            Example 2.D.1: extract elements of a list within a list into separate lists

                                                                                            >>> some_list_of_lists = [
                                                                                            ...     [1, 'a'],
                                                                                            ...     [2, 'b'],
                                                                                            ...     [3, 'c']
                                                                                            ...     ]
                                                                                            
                                                                                            >>> numbers, letters = zip(*some_list_of_lists)
                                                                                            
                                                                                            >>> numbers
                                                                                            (1, 2, 3)
                                                                                            
                                                                                            >>> letters
                                                                                            ('a', 'b', 'c')
                                                                                            

                                                                                            Example 2.D.2: transpose a list of lists

                                                                                            >>> some_list_of_lists = [
                                                                                            ...     [1, 2, 3],
                                                                                            ...     [4, 5, 6],
                                                                                            ...     [7, 8, 9]
                                                                                            ...     ]
                                                                                            
                                                                                            >>> list(zip(*some_list_of_lists))
                                                                                            [(1, 4, 7), (2, 5, 8), (3, 6, 9)]
                                                                                            

                                                                                            Example 3: single asterisk for unpacking a list into positional arguments of a function (*args)

                                                                                            >>> def print_d(a, b, c):
                                                                                            ...     print(f"a =  {a}")
                                                                                            ...     print(f"b =  {b}")
                                                                                            ...     print(f"c =  {c}")
                                                                                            ...
                                                                                            
                                                                                            >>> d = (1, 2, 3)
                                                                                            
                                                                                            >>> print_d(*d)
                                                                                            a =  1
                                                                                            b =  2
                                                                                            c =  3
                                                                                            

                                                                                            Example 4: single asterisk gives arbitrary number of positional arguments to a function (*args)

                                                                                            >>> def sum_arbitrary_number_of_arguments(*args):
                                                                                            ...     result = 0
                                                                                            ...     for arg in args:
                                                                                            ...         result+=arg
                                                                                            ...     return result
                                                                                            ...
                                                                                            
                                                                                            >>> sum_arbitrary_number_of_arguments(1+2+3)
                                                                                            6
                                                                                            
                                                                                            >>> sum_arbitrary_number_of_arguments(1+2+3+4)
                                                                                            10
                                                                                            

                                                                                            Double Asterisk (**)


                                                                                            Example 5: double asterisk to unpack dictionary into keyword arguments


                                                                                            Example 5.A: double asterisk to unpack dictionary into keyword arguments - Simple

                                                                                            >>> full_name = {'last_name': "Flintstone", 'first_name': "Fred", 'middle_name': "Bedrock"}
                                                                                            
                                                                                            >>> "{last_name}-{first_name}-{middle_name}".format(**full_name)
                                                                                            'Flintstone-Fred-Bedrock'
                                                                                            

                                                                                            Example 5.B: double asterisk to unpack dictionary into keyword arguments that match the function parameters (**kwargs)

                                                                                            >>> def print_d(a, b, c):
                                                                                            ...     print(f"a =  {a}")
                                                                                            ...     print(f"b =  {b}")
                                                                                            ...     print(f"c =  {c}")
                                                                                            ...
                                                                                            
                                                                                            >>> d = {'a': 1, 'b': 2, 'c': 3}
                                                                                            
                                                                                            >>> print_d(**d)
                                                                                            a =  1
                                                                                            b =  2
                                                                                            c =  3
                                                                                            

                                                                                            Example 6: merging dictionaries (double asterisk used multiple times)


                                                                                            Example 6.A: double asterisk for unpacking dictionaries in order to merge them

                                                                                            >>> d_1 = {'a': 1, 'b': 2}
                                                                                            
                                                                                            >>> d_2 = {'c': 3, 'd': 4 }
                                                                                            
                                                                                            >>> merged_dict = {**d_1, **d_2}
                                                                                            
                                                                                            >>> merged_dict
                                                                                            {'a': 1, 'b': 2, 'c': 3, 'd': 4}
                                                                                            

                                                                                            Example 6.B: double asterisk for unpacking dictionaries in order to merge them and override as needed

                                                                                            >>> d_1 = {'a': 1, 'b': 2}
                                                                                            
                                                                                            >>> d_2 = {'b': 99, 'c': 3, 'd': 4 }
                                                                                            
                                                                                            >>> merged_dict = {**d_1, **d_2}
                                                                                            
                                                                                            >>> merged_dict
                                                                                            {'a': 1, 'b': 99, 'c': 3, 'd': 4}
                                                                                            

                                                                                            Notice how the final value of b is 99 even though its original value was 2.

                                                                                            Note

                                                                                            The above is used to demonstrate dictionary unpacking. The more Pythonic way to merge dictionaries is to use the vertical bar (|) operator.

                                                                                            merged_dict = d_1 | d_2

                                                                                            Example 7: double asterisk gives an arbitrary number of keyworded inputs to a function (**kwargs)

                                                                                            >>> def specialized_print(**kwargs):
                                                                                            ...     for x,y in kwargs.items():
                                                                                            ...         print(f'{x} has a value of {y}')
                                                                                            ...
                                                                                            
                                                                                            >>> specialized_print(a=1, b=2, c=3)
                                                                                            a has a value of 1
                                                                                            b has a value of 2
                                                                                            c has a value of 3
                                                                                            
                                                                                            >>> specialized_print(a=1, b=2, c=3, d=4)
                                                                                            a has a value of 1
                                                                                            b has a value of 2
                                                                                            c has a value of 3
                                                                                            d has a value of 4
                                                                                            

                                                                                            Slicing

                                                                                            Another way to achieve the same outcomes as unpacking is to use slicing. Let's redo "Example 2.B.2" using slicing.

                                                                                            Code

                                                                                            >>> some_list = [14, 15, 16, 17, 18, 19]
                                                                                            
                                                                                            >>> # get first value
                                                                                            >>> some_list[0]
                                                                                            14
                                                                                            
                                                                                            >>> # get second value
                                                                                            >>> some_list[1]
                                                                                            15
                                                                                            
                                                                                            >>> # get from third value to the end but leave out the last two
                                                                                            >>> some_list[2:-2]
                                                                                            [16, 17]
                                                                                            
                                                                                            >>> # get second to last value
                                                                                            >>> some_list[-2]
                                                                                            18
                                                                                            
                                                                                            >>> # get last value
                                                                                            >>> some_list[-1]
                                                                                            19
                                                                                            

                                                                                            Notice how much more verbose the slicing syntax is. In addition, thought needs to be given to the appropriate value(s) for indexing.

                                                                                            However, comparing unpacking and slicing is not helpful because their application will depend on the use case. It is included for the sake of completeness because people will often ask about slicing when the topic of unpacking is brought up.

                                                                                            Summary

                                                                                            Unpacking besides assigning variables enables the specification of the size and shape of the iterable.