Sunday, March 2, 2025

Identify Nouns & Verbs in Text Using Natural Language Toolkit (NLTK)

Introduction

One of the common tasks that I have to do is identify nouns and verbs in text. If I am doing object oriented programming, the nouns will help in deciding which classes to create. Even if I am not doing object oriented programming, having a list of nouns and verbs will facilitate understanding.

Python Advanced Concepts

Some of the Python code used might be considered "advanced".

For example, we will use a list of tuples. In my opinion, going for ["txt_1", "txt_2"] to [("txt_1a", "txt_1b"), ("txt_2a", "txt_2b")] is no big deal. Refer to output 2 for how it applies in this write-up.

Also, Counter() is used to count the number of times that a particular tuple occurs. Refer to output 3 for how it applies in this write-up.

The contents of Counter() is a dictionary of dictionaries. The key of the inner dictionary is a tuple. Refer to output 3 for how it applies in this write-up.

Natural Language Toolkit (NLTK)

NLTK is used for tokenization and then assigning parts of speech (POS) to those tokens. Unfortunately, NLTK is large and complex. The reference section at the end provides links and short text snippets for the applicable NLTK components.

Computing Environment

Python version 3.12.9 is used.

NLTK version 3.9.1 is used.

I used Anaconda to install NLTK. It did not install nltk.download() by default because it takes approximately 2.5 GB. Consequently, I had to install punkt_tab in order to get nltk.word_tokenize() to work.

>>> import nltk

>>> nltk.download('punkt_tab')

Similarly, I had to install averaged_perceptron_tagger_eng (Source code for nltk.tag.perceptron, class nltk.tag.perceptron.AveragedPerceptron) in order to get nltk.pos_tag() to work.

>>> import nltk

>>> nltk.download('averaged_perceptron_tagger_eng')Software Structure

To go to the file containing the working code click here.

The structure of the code is as follows

Create text data
Process the text to tokenize it, assign a POS to it, and then count the combination of tokens and POS.
Filter data such that only nouns and verbs remain.
Sort data such that the output has desired format.

Create Text Data

The test data is created as follows

Code Snippet 1

TEXT_1 = """
This is chunk one of the text.

The leaves on a tree are green.

The leaves on a tree aren't red.
"""

TEXT_2 = """
This is chunk two of the text.

The color of the car is red.

The color of the car also contains blue.

The color of the car isn't green.
"""

LIST_OF_STRINGS = [TEXT_1, TEXT_2]

TEXT_1 and TEXT_2 are easily modifiable so that they can be altered by reader with their own test data. Also, it is easy to expand the data to use TEXT_3, TEXT_4 and so on because the processing is based on a list of text contained in LIST_OF_STRINGS.

Process Text: tokenize, assign POS to tokens, count combinations of token and POS

The processing of text is accomplished with the following code snippet.

Code Snippet 2

# For the combination of a word and its part of speech,
# create a count of its associated appearance
for i, string in enumerate(LIST_OF_STRINGS):

    # Decompose string into tokens
    string_tokenized = nltk.word_tokenize(string)

    # Assign part of speech to each token
    string_pos_tag = nltk.pos_tag(string_tokenized)

    # Count the number of token, parts of speech combinations
    if i == 0:
        count_token_pos = Counter(string_pos_tag)
    else:
        count_token_pos.update(Counter(string_pos_tag))

Tokenize

Tokenization is accomplished via nltk.word_tokenize().

Its output is a list of strings.

Output 1

['This', 'is', 'chunk', ... 'are', "n't", 'red', '.']

Since the nature of the material is introductory, will not worry about edge cases like contractions ( "n't").

Assign POS to Tokens

Association of a POS w/ a token is accomplished via nltk.pos_tag().

Its output is an array of tuples where each tuple consists of a token and a POS.

Output 2

[('This', 'DT'), ('is', 'VBZ'), ('chunk', 'JJ'), ... ("n't", 'RB'), ('red', 'JJ'), ('.', '.')]
String POS Ta

'DT' stands for determiner. 'VBZ' stands for a verb that is third person singular present. For the full list, refer to Alphabetical list of part-of-speech tags used in the Penn Treebank Project at UPenn.Edu

Count Combinations of Token and POS

Counter() is used to count the number of occurences a combination of token and POS occurs.

The output of Counter(string_pos_tag) is

Output 3 (Just for processing TEXT_1)

Counter(

    {

        ('.', '.'): 3, 

        ('The', 'DT'): 2, 

        ('leaves', 'NNS'): 2, 

        ...

        ('green', 'JJ'): 1, 

        ("n't", 'RB'): 1, 

        ('red', 'JJ'): 1

    }

)

Notice that the contents of Counter() is a dictionary of dictionaries. The key of the inner dictionary is a tuple. The tuple key consists of a token and a POS as shown in output 2. The integer is simply the number of times that the tuple occurs.

Filter Data for Nouns & Verbs

The first step is to identify the POS that correspond to nouns and verbs.

This is implemented via the following code snippet.

Code Snippet 3

NOUN_POS = ["NN", "NNS", "NNP", "NNPS"]

VERB_POS = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]

'NN' stands for a noun which is singular. 'NNS' stands for a noun which is plural. For the full list, refer to Alphabetical list of part-of-speech tags used in the Penn Treebank Project at UPenn.Edu

The filtering is implemented via the following code snippet.

Code Snippet 4

list_noun_count = []
list_verb_count = []
for token_pos in count_token_pos:
    if token_pos[1] in NOUN_POS:
        list_noun_count.append((token_pos[0], count_token_pos[token_pos]))
    elif token_pos[1] in VERB_POS:
        list_verb_count.append((token_pos[0], count_token_pos[token_pos]))

In the above code, there is nothing worthy of note. Just an if statement inside a for loop. If people need a refresher on iterating through a dictionary, consider reading "How to Iterate Through a Dictionary in Python" by Leodanis Pozo Ramos.

The output of this step is provided below.

Output 4.A: List of nouns and their counts

[ ('text', 2), ('leaves', 2), ('tree', 2), ('color', 3), ('car', 3) ]

Output 4.B: List of verbs and their counts

[ ('is', 4), ('are', 2), ('contains', 1) ]

Notice that the output is a list of tuples. Also, each tuple consists of a string and an integer. This will be important when the lists are sorted.

Sort Data & Generate Output

One set of output will be nouns and their associated counts. This needs to be sorted either by the nouns alphabetically or by their counts. Otherwise, the data will be in some sort of random order. Similarly, another set of output will be be verbs and their counts which will also need to be sorted.

This is implemented via the following code snippet.

Code Snippet 5

# Sort data alphabetically
list_noun_count.sort()
list_verb_count.sort()

with open("noun_counts_alphabetized_by_noun.txt", "w", encoding="utf8") as file:
    for noun_count in list_noun_count:
        file.write(noun_count[0] + ", " + str(noun_count[1]) + "\n")

with open("verb_counts_alphabetized_by_verb.txt", "w", encoding="utf8") as file:
    for verb_count in list_verb_count:
        file.write(verb_count[0] + ", " + str(verb_count[1]) + "\n")

# Sort data by their counts
list_noun_count.sort(key=lambda noun_count: noun_count[1], reverse=True)
list_verb_count.sort(key=lambda noun_count: noun_count[1], reverse=True)

with open("noun_counts_by_increasing_count.txt", "w", encoding="utf8") as file:
    for noun_count in list_noun_count:
        file.write(noun_count[0] + ", " + str(noun_count[1]) + "\n")

with open("verb_counts_by_increasing_count.txt", "w", encoding="utf8") as file:
    for verb_count in list_verb_count:
        file.write(verb_count[0] + ", " + str(verb_count[1]) + "\n")

The first thing to note is that sorting a list is done in place. Consequently, can just use list_noun_count.sort(). This is different from when use the built-in function sorted() because it returns a new sorted list.

The second thing to note is that specified a key for sorting by count via lambda noun_count: noun_count[1]. Reacall from Output 4.A and 4.B that we are sorting a list of tuples. If people need a refresher on sorting, consider reading "How to Use sorted() and .sort() in Python" by David Fundakowski.

Lastly, please make a habit of specifying utf8 as the encoding when creating a file. Unfortunately, Python uses the OS default for encoding [locale.getencoding()]. This is especially important when using the Windows operating system because it defaults to cp1252.

The final output is provided below.

Output 5.A: Nouns sorted by count

car, 3
color, 3
leaves, 2
text, 2
tree, 2

Output 5.B: Verbs sorted by count
```
is, 4
are, 2
contains, 1
```

Alternative Approaches

An alternative to NLTK is spaCy. If Google "Identify Nouns & Verbs in Text Using spaCy" will obtain results which will help with starting the journey.

Summary

Determining nouns and verbs is simple. :-)

Tokenize using nltk.word_tokenize().
Associated a POS with each token using nltk.pos_tag().
Count the number of times that a combination of token and POS occur by using the Python Counter module.
Filter such that only noun and verb POS remain.
Sort the data as needed to generate the desired output.

However, as is commonly said, the devil is in the details. If you are interested in the details, check out the reference section below.

References

Articles
1. A Good Part-of-Speech Tagger in about 200 Lines of Python by Matthew Honnibal
2. Alphabetical list of part-of-speech tags used in the Penn Treebank Project at UPenn.Edu
  1. Tag: NN
    Description: Noun, singular or mass
  2. ...
3. How to Iterate Through a Dictionary in Python by Leodanis Pozo Ramos
4. How to Use sorted() and .sort() in Python by David Fundakowski
5. NLTK Lookup Error at Stack Over Flow
  1. First answer said the missing module is 'the Perceptron Tagger', actually its name in nltk.download is 'averaged_perceptron_tagger'
  2. You can use this to fix the error
    nltk.download('averaged_perceptron_tagger')
6. What to download in order to make nltk.tokenize.word_tokenize work? at Stack Over Flow
  1. At home, I downloaded all nltk resources by nltk.download() but, as I found out, it takes ~2.5GB.
  2. Try to download ... punkt_tab
    import nltk
    nltk.download('punkt_tab')
Books
1. Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit by Steven Bird, ... (Official NLTK Book)
  1. Chapter 5: Categorizing and Tagging Words
Natural Language Toolkit (NLTK)
1. NLTK Documentation
  1. All modules for which code is available (_modules)
    nltk
    Source code for nltk
    nltk.tag
    Source code for nltk
    nltk.tag.perceptron
    Source code for nltk.tag.perceptron
    TAGGER_JSONS = { "eng": { "weights": "averaged_perceptron_tagger_eng.weights.json", "tagdict": "averaged_perceptron_tagger_eng.tagdict.json", "classes": "averaged_perceptron_tagger_eng.classes.json", },
  2. NLTK API
    Sub-Packages
    nltk.downloader module
    Introduction
    The NLTK corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with NLTK.
    Downloading Packages
    Individual packages can be downloaded by calling the download() function with a single argument, giving the package identifier for the package that should be downloaded:
    >>> download('treebank')
    [nltk_data] Downloading package 'treebank'...
    [nltk_data] Unzipping corpora/treebank.zip.
    Downloader
    Downloader.download()
    Sub-Packages
    nltk.tag package
    Module Contents
    NLTK Taggers
    This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.
    A “tag” is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):
    >>> tagged_tok = ('fly', 'NN')
    An off-the-shelf tagger is available for English. It uses the Penn Treebank tagset:
    >>> from nltk import pos_tag, word_tokenize
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
    [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
    pos_tag()
    NB. Use pos_tag_sents() for efficient tagging of more than one sentence.
    pos_tag_sents()
    Sub-Modules
    nltk.tag.perceptron module
    class nltk.tag.perceptron.AveragedPerceptron
    nltk.tokenize package
    Module Contents
    NLTK Tokenizer Package
    Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:
    >>> from nltk.tokenize import word_tokenize
    >>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.'''
    >>> word_tokenize(s)
    ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
    This particular tokenizer requires the Punkt sentence tokenization models to be installed.
    word_tokenize()
    Sub-Modules
    nltk.tokenize.punkt module
    Punkt Sentence Tokenizer
    This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.
    The NLTK data package includes a pre-trained Punkt tokenizer for English.
    save_punkt_params()
    nltk.tokenize.punkt.save_punkt_params(params, dir='/tmp/punkt_tab'
Python
1. Binary Data Services
  1. codecs - Codec registry and base classes
    Standard Encodings
    Codec: cp1252 - Aliases: windows-1252 - Language: Western Europe
    Codec: utf_8 - Aliases: U8, UTF, utf8, cp65001 - Language: all languages
2. Glossary
  1. locale encoding
    On Windows, it is the ANSI code page (ex: "cp1252").
3. Standard Library
  1. Built-In Functions
    open()
    In text mode, if encoding is not specified the encoding used is platform-dependent: locale.getencoding() is called to get the current locale encoding.
    sorted()
  2. Built-In Types
    Sequence Types — list, tuple, range
    class list()
    sort()
  3. Data Types
    collections: Container Data Types
    class collections.Counter()
    update()
  4. Internationalization [ I18N ]
    locale — Internationalization services
    locale.getencoding()
spaCy

Tuesday, February 25, 2025

Synthetic Data Generation References

Generating synthetic data has been a topic of interest since the early days of conventional databases. Recently, it has become a hot topic because of the large language model (LLM) phenomena. Basically everyone has scrapped the internet and now more data is needed. Also, higher quality data is needed. For further details, refer to the video "Jonathan Ross, Founder & CEO @ Groq: NVIDIA vs Groq - The Future of Training vs Inference | E1260" at 20VC with Harry Stebbings. Start listening 2 minutes in and then just listen for 3 minutes.

This blog post contains a list of references which hopefully will save others some leg work.

Articles

Don't use diffprivlib by Ted
1. If you're looking for a Python library to perform differential privacy computations, diffprivlib seems to be an attractive choice. You'll find it prominently featured in Google search results. It's maintained by IBM, and extensively cited in the scientific literature. Its README states that you can use it to "build your own differential privacy applications"1, and it's regularly updated. Last but not least, it's very easy to pick up: its API mimics well-known tools like NumPy or scikit-learn, making it look simple and familiar to data scientists.
2. Unfortunately, diffprivlib is flawed in a number of important ways. I think most people should avoid using it. This blog post lists a few reasons why.
Synthetic Data by Robert Riemann at the European Data Protection Supervisor

Books

A Hands-On Guide to Machine Learning with R by Norman Matloff
1. Part V: Applications
  1. Chapter V.12: Image Classification
    Tricks of the Trade
    Data Augmentation
Designing Machine Learning Systems by Chip Huyen
1. Chapter 4: Training Data
  1. Data Augmentation
Practical Synthetic Data Generation by Khaled El Emam, ...
Synthetic Data and Generative AI by Vincent Granville
Synthetic Data for Deep Learning by Sergey I. Nikolenko
Synthetic Data for Machine Learning by Abdulrahman Kerim

Vendors

Friday, August 16, 2024

Code Snippets Demonstrating Unpacking in Python

Introduction

Unpacking is the process of extracting values from a sequence/iterable and assigning them to multiple variables using only a single line of code.

It is assumed that the reader is familiar with Python lists, dictionaries, tuples and the built-in function zip.

A picture is worth a thousands words. In software, we have a similar concept: a carefully selected code snippet is worth a thousand words. This blog post will use the code snippet approach.

Not Using Asterisks

Example 1.A: multiple assignment using a list

>>> a, b, c = [1, 2, 3]

>>> a
1

>>> b
2

>>> c
3

Example 1.B: multiple assignment using any iterable

Previous example used a list. This example, uses a string. It illustrate that unpacking works with any iterable.

Code

>>> d, e, f = '456'

>>> d
'4'

>>> e
'5'

>>> f
'6'

Example 1.C: multiple assignment using a dictionary

Example 1.C.1: just the dictionary keys

Code

>>> x, y = {'g': 7, 'h': 8}

>>> x
'g'

>>> y
'h'

Notice that the above unpacking retrievs only the keys of the dictionary.

Example 1.C.2: the dictionary key and associated value together

For the details on the items() method of a dictionary, click here.

Code

>>> d = {'a': 1, 'b': 2, 'c': 3}

>>> x, y, z = d.items()

>>> x
('a', 1)

>>> x[0]
'a'

>>> x[1]
1

>>> y
('b', 2)

>>> z
('c', 3)

Single Asterisk (*)

Example 2: single asterisk to unpack a list

Example 2.A: single asterisk to unpack a list into the [ last | first ] assignment variable

Example 2.A.1: single asterisk to unpack a list into the last assignment variable

>>> a, *b = [1, 2, 3]

>>> a
1

>>> b
[2, 3]

Example 2.A.2: single asterisk to unpack a list into the first assignment variable

>>> *c, d = [7, 8, 9]

>>> c
[7, 8]

>>> d
9

Example 2.B: single asterisk to unpack a list into the middle assignment variable

Example 2.B.1: single asterisk to unpack a list into the middle assignment variable - Simple

>>> e, *f, g = [10, 11, 12, 13]

>>> e
10

>>> f
[11, 12]

>>> g
13

Example 2.B.2: single asterisk to unpack a list into the middle assignment variable - Complex

Code

>>> h, i, *j, k, l = [14, 15, 16, 17, 18, 19]

>>> h
14

>>> i
15

>>> j
[16, 17]

>>> k
18

>>> l
19

Warning: putting too many variable initializations on the same line can cause readability issues. The above code snippet demonstrates this. One has to read the code carefully to determine what gets saved into "*j".

Example 2.C: single asterisk to unpack a list within a list

Example 2.C.1: single asterisk to unpack a list within a list - Once

>>> some_numbers = [1, 2, 3]

>>> more_numbers = [*some_numbers, 4, 5]

>>> more_numbers
[1, 2, 3, 4, 5]

Example 2.C.2: single asterisk to unpack a list within a list - Multiple Times

>>> some_numbers = [1, 2, 3]

>>> some_other_numbers = [4, 5, 6]

>>> [*some_numbers, *some_other_numbers, 7, 8]
[1, 2, 3, 4, 5, 6, 7, 8]

Example 2.D: combine built-in function zip with single asterisk unpacking

Example 2.D.1: extract elements of a list within a list into separate lists

>>> some_list_of_lists = [
...     [1, 'a'],
...     [2, 'b'],
...     [3, 'c']
...     ]

>>> numbers, letters = zip(*some_list_of_lists)

>>> numbers
(1, 2, 3)

>>> letters
('a', 'b', 'c')

Example 2.D.2: transpose a list of lists

>>> some_list_of_lists = [
...     [1, 2, 3],
...     [4, 5, 6],
...     [7, 8, 9]
...     ]

>>> list(zip(*some_list_of_lists))
[(1, 4, 7), (2, 5, 8), (3, 6, 9)]

Example 3: single asterisk for unpacking a list into positional arguments of a function (*args)

>>> def print_d(a, b, c):
...     print(f"a =  {a}")
...     print(f"b =  {b}")
...     print(f"c =  {c}")
...

>>> d = (1, 2, 3)

>>> print_d(*d)
a =  1
b =  2
c =  3

Example 4: single asterisk gives arbitrary number of positional arguments to a function (*args)

>>> def sum_arbitrary_number_of_arguments(*args):
...     result = 0
...     for arg in args:
...         result+=arg
...     return result
...

>>> sum_arbitrary_number_of_arguments(1+2+3)
6

>>> sum_arbitrary_number_of_arguments(1+2+3+4)
10

Double Asterisk (**)

Example 5: double asterisk to unpack dictionary into keyword arguments

Example 5.A: double asterisk to unpack dictionary into keyword arguments - Simple

>>> full_name = {'last_name': "Flintstone", 'first_name': "Fred", 'middle_name': "Bedrock"}

>>> "{last_name}-{first_name}-{middle_name}".format(**full_name)
'Flintstone-Fred-Bedrock'

Example 5.B: double asterisk to unpack dictionary into keyword arguments that match the function parameters (**kwargs)

>>> def print_d(a, b, c):
...     print(f"a =  {a}")
...     print(f"b =  {b}")
...     print(f"c =  {c}")
...

>>> d = {'a': 1, 'b': 2, 'c': 3}

>>> print_d(**d)
a =  1
b =  2
c =  3

Example 6: merging dictionaries (double asterisk used multiple times)

Example 6.A: double asterisk for unpacking dictionaries in order to merge them

>>> d_1 = {'a': 1, 'b': 2}

>>> d_2 = {'c': 3, 'd': 4 }

>>> merged_dict = {**d_1, **d_2}

>>> merged_dict
{'a': 1, 'b': 2, 'c': 3, 'd': 4}

Example 6.B: double asterisk for unpacking dictionaries in order to merge them and override as needed

>>> d_1 = {'a': 1, 'b': 2}

>>> d_2 = {'b': 99, 'c': 3, 'd': 4 }

>>> merged_dict = {**d_1, **d_2}

>>> merged_dict
{'a': 1, 'b': 99, 'c': 3, 'd': 4}

Notice how the final value of b is 99 even though its original value was 2.

Note

The above is used to demonstrate dictionary unpacking. The more Pythonic way to merge dictionaries is to use the vertical bar (|) operator.

merged_dict = d_1 | d_2

Example 7: double asterisk gives an arbitrary number of keyworded inputs to a function (**kwargs)

>>> def specialized_print(**kwargs):
...     for x,y in kwargs.items():
...         print(f'{x} has a value of {y}')
...

>>> specialized_print(a=1, b=2, c=3)
a has a value of 1
b has a value of 2
c has a value of 3

>>> specialized_print(a=1, b=2, c=3, d=4)
a has a value of 1
b has a value of 2
c has a value of 3
d has a value of 4

Slicing

Another way to achieve the same outcomes as unpacking is to use slicing. Let's redo "Example 2.B.2" using slicing.

Code

>>> some_list = [14, 15, 16, 17, 18, 19]

>>> # get first value
>>> some_list[0]
14

>>> # get second value
>>> some_list[1]
15

>>> # get from third value to the end but leave out the last two
>>> some_list[2:-2]
[16, 17]

>>> # get second to last value
>>> some_list[-2]
18

>>> # get last value
>>> some_list[-1]
19

Notice how much more verbose the slicing syntax is. In addition, thought needs to be given to the appropriate value(s) for indexing.

However, comparing unpacking and slicing is not helpful because their application will depend on the use case. It is included for the sake of completeness because people will often ask about slicing when the topic of unpacking is brought up.

Summary

Unpacking besides assigning variables enables the specification of the size and shape of the iterable.

Sunday, November 20, 2022

Parallelizing Nested For-Loops in Python

Introduction

Once upon a time Smith had to obtain new transportation. He had to first decide whether he was going to buy a car or truck. Once he decided to buy a car, he had to decide between the various brands of cars like Ford and GM. Luckly, once Smith decided that he wanted a car; Mary (Smith's wife) decided to help with the research. Consequently, the Ford and GM car models could be researched concurrently.

Generalized Problem Statement

The above story can be used to create a generalized problem statement. For example, Smith has to decide whether he wants a car or a truck before he conducts any other research. This can be generalized to say that all categories in level 1 (car, truck, ...) must be processed before any level 2 categories (Ford, GM, ...) can be addressed.

We can continue with this generalization by saying that all level 2 categories can be processed concurrently. Remember that Smith can research Ford cars while Mary researches GM cars. In other words, the research of Ford and GM cars can happen concurrently.

The above has created a hierarchy of categories 2 levels deep. There are already many good examples on the internet that demonstrate parallelizing a hierarchy 2 levels deep. One example of this is "How to reduce the time complexity of nested loops" by Leandro Proença. So, we are going to add a third level to the hierarchy to make the problem more realistic. The other additional twist is that all level 3 categories associated with a category level 2 can be processed in parallel once the level 2 category has been processed. Luckly for us, we can assume that the processing of each category is approximately the same.

Below is a pictoral of the category hierarchy that will be used in the proof of concept code (input_data.py). For a PDF version of the data click here. For a spreadsheet version of the data click here.

Once all the categories in all the hierarchies are processed, summary counts are computed.

The above problem is a little complicated. This is done on purpose. There are already many good examples on the internet that demonstrate parallelizing simple problems. My goal in this blog is to introduce some complexity in order to reflect a more real world problem.

For those who prefer to go to straight to the code, click here. For those who prefer a description of the problem as well as the architecture used to solve the problem, please continue reading.

Nested For-Loops

For those who prefer to go to straight to the code for nested for-loops, click here. For those who prefer a description as to the how and why of the code, please continue reading.

One of the simplest things to do is to go through the 3 levels of the hierarchy and process all the combinations via nested for-loops. Think of it as examining a data cube. Sometimes people can get carried away with nested for-loops. Below is a picture to emphasize this.

The nested for-loop portion of the code is below.

for category_lvl_1 in categories_lvl_1:

    try:

        categories_lvl_2 = categories_lvl_1_2[category_lvl_1]

        for category_lvl_2 in categories_lvl_2:

            common_processing.process_categories_non_lvl_1(
                2, category_lvl_1, category_lvl_2
            )

            try:

                categories_lvl_3 = categories_lvl_2_3[category_lvl_2]

                for category_lvl_3 in categories_lvl_3:

                    common_processing.process_categories_non_lvl_1(
                        3, category_lvl_2, category_lvl_3
                    )

            except KeyError:

                ...

        except KeyError:

            ...

Notice that the processing of the combination category N and (N + 1) is identical and is encapsulated by common_processing.process_categories_non_lvl_1(...). We will use this fact when we parallelize the processing. Also the processing of level 1 categories [process_categories_lvl_1()] as well as the computing of summary counts [summary_counts()] is included inn the common_processing module. In other words we have prepared the code for parallelization.

There are several issues with using nested for-loops. The use of them implies an ordering even though one may not exist. In our case there is no ordering because all level 2 categories can be processed in parallel.

The other issue is that nested for-loops are often the source of performance issues. In our case O(N³).

Also, if you notice the above code has try/except blocks because all the generated combinations do not necessarily exist. It would be better not to generate combinations that don't exist and in so doing avoid adding try/except code blocks.

Concurrently Process All Level 2 Categories

For those who prefer to go to straight to the code for concurrently processing all level 2 categories, click here. For those who prefer a description as to the how and why of the code, please continue reading.

The nature of the problem is such that it allows for the parallel processing of all level 2 categories. The ideal translation of this into code is as follows:

with Pool() as pool:
    pool.imap_unordered(process_category_lvl_2, get_level_2_categories())
    pool.close()
    pool.join()

All that is needed is to code up process_category_lvl_2() and get_level_2_categories(). The implementation of process_category_lvl_2() involves the following steps

Obtain the category level 1 associated with the input level 2 category.
Process the level 1, 2 category combination.
Process the level 2, 3 combinations for the input level 2 category.

The implementation of get_level_2_categories() is straight forward. In most cases it will involve a simple query to a database.

Consequently, the main function is simply

common_processing.process_categories_lvl_1()

with Pool() as pool:
    pool.imap_unordered(process_category_lvl_2, get_level_2_categories())
    pool.close()
    pool.join()

common_processing.summary_counts()

Notice how much simpler the above code is compared to the nested for-loops described previously.

The multiprocessing.pool.Pool() is being used as a throttling mechanism. We don't want to flood the system with a large number of concurrent processes each fighting for resources.

If you want to impress people with big words, you can say that you used the "Single Instruction, Multiple Data (SIMD)" design pattern. Another set of big words that you can use is "embarrassingly parallel".

Nested For-Loops was a Lookup Problem in Disguise

Notice how a nested for-loop was actually a lookup problem in disguise.

There is an initial lookup to get the set of level 2 categories.
There is another lookup to get the level 1 category associated with a level 2 category.
There is a final lookup to get all the level 3 categories associated with a level 2 category.

Further Parallelization

Let's try to take advantage of the fact that all level 3 categories associated with a completed level 2 category can be processed in parallel.

One approach is to pass the pool to the concurrent processes being executed. Unfortunately, this won't work because "the methods of a pool should only ever be used by the process which created it."

Another approach is to create a Pool() in the concurrent processes being executed. Unfortunately, this won't work because "cannot create a Pool in pooled function."

Unfortunately, this means that will have to re-architect the approach. Luckly for me, just concurrently executing level 2 categories is good enough at this point in time.

However, for those who have use cases that require further parallelization, one can use Managers. It provides a workaround for the inability to directly share a process pool. For the details, refer to the article "Share a Multiprocessing Pool With Workers" by Jason Brownlee.

The danger of using a Manager is that you might end up creating your own framework for concurrent execution. Perhaps, it woud be best to just use one of the existing frameworks? Below is a partial list of these types of frameworks to help you get started in their exploration.

Also, you might get lucky and meet performance goals by using Numba ("just" add the @jit decorator). For additional details, refer to the article "Explicit Parallel Loops" at Numba.ReadTheDocs.Io

Don't Use Python

Ultimately, if you want blazing performance, you'd have to convert from Python to some other language. One alternative is Cython. Another alternative is Julia (Parallel Map and Loops). There are many others. If this type of approach is used, please consider asking if the overhead added by another language in your infrastructure worth it?

Miscellaneous Technical Details

Since all the category processing has to complete before computing summary counts, have to issue a pool close() followed by a pool join(). For the details of why this is necessary, refer to the article "7 Multiprocessing Pool Common Errors in Python" by Jason Brownlee. Unfortunately, can't use the pool context manager to accomplish this goal. For the details of why this is, refer to the article "Multiprocessing Pool Context Manager" by Jason Brownlee.

The other thing is that to gurantee that print statements will actually show up, the associated buffer must be flushed. For the details refer to the following articles

To Dos

Unfortunately, there is an infinite number of things to do and a limited amount of time. Consequently, I have left out a few things which I hope to get to at a later point.

The first are tests to ensure that the parallel solution yields the same results as the naive nested loop solution. Currently, I manually compare the files nested_for_loops_output.txt and parallelize_level_2_categories_output.txt to see that the same exact category pair processing is happening. This can not easily be automated because the order of the category pair processing will be different in the two files.

The second is benchmarking/profiling. In my particular case, I will fix the number of lines of text to categorize and then vary the number of physical cores and the number of processes created by Pool(). For guidance on setting the number of processes created by Pool() refer to the following articles

Summary

The selected use case is slightly complex in order to better reflect a real world problem. Also, it enables a discussion of tradeoffs that always needs to be made.

One thing that has not been mentioned so far is that it is a must to write down the definition of done before doing anything. In my particular use case the definition of done is to map x number of lines of text to categories is one hour or less. The mapping is done by process_categories_lvl_1() and process_categories_non_lvl_1() in the common module.

The other thing that needs to be written down are the constraints. In my particular use case, we don't want to modify the code in the common_processing module because they are highly complex. Consequently, the easiest way to reduce processing time is to execute all the level 2 category processing in parallel.

Lastly, it should be noted that there is no general solution to parallelizing nested for-loops. Each use case is different. In the use case addressed in this blog we took advantage of the fact that we could process all level 2 categories in parallel. However, for some general guidelines on parallelizing nested for-loops, refer to the article "Parallel Nested For-Loops in Python" by Jason Brownlee. The other thing to note is that it is not always possibe to parallelize nested for loops.

References

The purpose of the reference section is to make it easy for the reader to find things. Also, the key passages within the link is provided. This way don't have to search an entire article to find the applicable material.

The reference section is lengthy and so it has been placed on a separate web page. To go to it, click here.

Tuesday, April 13, 2021

Python: Nested Dictionaries vs Dictionary Plus a Dataclass

The use case involves reading user attributes like name, status, and gender given the associated user UUID. This is a write-once and read many times scenario. Also, it is assumed that all the data will fit into memory.

Nested Dictionaries

The nested dictionary creation will look something like the following:

nested_dict_approach = {  
    '19acc7df-9c8b-11eb-9022-cc2f71aeb20c' : {
        'name': 'fred',
        'status': 'active',
        'gender': 'M' 
    },
    '19acc7df-9c8b-11eb-9022-cc2f71aeb20d' : {
        'name': 'barney',
        'status': 'inactive',
        'gender': 'M' 
    },
    '19acc7df-9c8b-11eb-9022-cc2f71aeb20e' : {
        'name': 'wilma',
        'status': 'unknown',
        'gender': 'F' 
    },
}

The constant repetition of the nested key names is annoying and results in verbose and error-prone code.

To access a particular user via a UUID is straightforward.

nested_dict_approach['19acc7df-9c8b-11eb-9022-cc2f71aeb20c']

{'name': 'fred', 'status': 'active', 'gender': 'M'}

The downside of the above output is that it is just a list of attributes with no unifying principle.

To access a particular attribute of a user via a UUID is also straightforward.

nested_dict_approach['19acc7df-9c8b-11eb-9022-cc2f71aeb20c']['name']

'fred'

Dictionary Plus a Dataclass

The dictionary plus a dataclass creation will look something like the following.

from dataclasses import dataclass

@dataclass
class User:
    name: str
    status: str
    gender: str

user_1 = User('fred', 'active', 'M')
user_2 = User('barney', 'inactive', 'M')
user_3 = User('wilma', 'unknown', 'F')

dict_plus_data_class_approach = {  
    '19acc7df-9c8b-11eb-9022-cc2f71aeb20c' : user_1,
    '19acc7df-9c8b-11eb-9022-cc2f71aeb20d' : user_2,
    '19acc7df-9c8b-11eb-9022-cc2f71aeb20e' : user_3,
}

The constant repetition of the nested key names is eliminated. The other benefit is that we can specify the data types of each of the attributes.

Accessing a particular user via a UUID is the same.

dict_plus_data_class_approach['19acc7df-9c8b-11eb-9022-cc2f71aeb20c']

User(name='fred', status='active', gender='M')

However, notice that the output now is not just a list of attributes. The attributes are organized into a User.

To access a particular attribute of a user via a UUID just use a dot (".") as opposed to square brackets ("[]").

dict_plus_data_class_approach['19acc7df-9c8b-11eb-9022-cc2f71aeb20c'].name

'fred'
Summary

In summary, if you are using nested dictionaries, stop and consider using a dictionary combined with a dataclass.

Friday, August 28, 2020

Python Iterable: Sequences, Non-Sequence Iterables and So On

Iterable is a key concept in Python. Think of it as allowing you to do a for loop over "stuff." This includes being able to do for loops over non-sequences like dictionaries and sets.

Trying to determine which of Python's various data types support the iterable concept can be tedious. Below is a table that helps with that. Also, the links in the table go to the corresponding Python documentation.

Please note that the table is not meant to be an exhaustive enumeration of where an iterable is used. The goal is to help people get started on their journey. For example, the table contains collections.abc.Sequence but it does not contain the other applicable abstract base classes for containers.

Technical Note: I couldn't get Blogger.Com to properly render the table. Consequently, I have included a screenshot of the table below. To go to the table with the links, click here.

Wednesday, February 12, 2020

How do you handle the situation when your data does not fit into memory?

If you don't have much time to devote to the topic, just watch the video Small Big Data: using NumPy and Pandas when your data doesn't fit in memory by Itamar Tuner-Trauring at PyData New York 2019.

This won't be a conventional blog post. It won't have paragraph after paragraph. Instead, it will be a list around which people can have conversations. The links within the list and the references at the end will provide the details.

The majority of the list items below come from Itamar Tuner-Trauring video mentioned earlier.

1. Small Big Data: too big to fit in memory but too small for a complex big data cluster

a. Load only what you need

i. Do you really need to load all the columns?

b. Use the appropriate data type

i. If possible, use NumPy 16 bit unsigned integers as opposed to its 64-bit version.

ii. If possible, use Pandas category rather than an object data type.

c. Divide data into chunks - Process 1 chunk at a time

i. pandas.read_csv( ... chunksize = None ...)

ii. Zarr: chunked, compressed, N-dimensional arrays

d. Use Generator Expression: lazily generated iterable objects

i. Python’s Generator Expressions: Fitting Large Datasets intoMemory by Luciano Strika

A. Commonly used example


with open('big.csv') as f:

for line in f:
process(line)

e. Indexing: only need a specific subset of the data

i. populate SQLite from Pandas

A. load from SQLite into DataFrame

ii. vroom R package

f. If your data is sparse, take advantage of it.

i. Pandas - Sparse Data Structures

ii. sparse.pydata.org

g. Approaches Involving Approximations

i. Requires application of domain knowledge

ii. Just sample your data.

A. pd.read_csv(..., skiprows=lambda x: x % 2 != 0)

iii. Probabilistic Data Structures

A. Introduction to Probabilistic Data Structures by Yin Niu

h. Streaming

i. Streamz

i. Use a single machine distributed framework

i. Single Machine: dask.distributed

j. Just load your data into a database like PostgreSQL.

k. Just use the disk.

i. It will be slow but perhaps acceptable until can implement a better solution.

l. Money Solution

i. Buy or rent computers with large amounts of RAM

2. Big Data: anything that is not "small big data"

a. Dask, Spark, ...

References

1. NumPy Reference

a. Array Objects

i. Data Type Objects (dtype)

2. Pandas

a. Getting Started

i. Essential Basic Functionality

A. dtypes

3. Python Documentation

a. Language Reference

i. Expressions

A. Atoms

I. Generator Expressions

4. Small Big Data: using NumPy and Pandas when your data doesn't fit in memory by Itamar Tuner-Trauring at PyData New York 2019

5. What are all the dtypes that pandas recognize? at StackOverFlow.Com

6. When to use Category rather than Object? at StackOverFlow.Com