Sunday, October 21, 2018

Python: Iterators - Yield Statement - Generators - Comprehensions

The topics of iterators, yield statement, generators and comprehensions are of interest to anyone that uses for loops, nested for loops and is concerned about the compactness of their code and it's associated performance.

The purpose of this blog post is to provide a single spot where an introduction to iterators, the yield statement, generators and comprehensions can be found. The introductions will be made by first starting with a code snippet and then providing an explanation. For those that are interested in the corresponding Python documentation, a reference section is provided at the end.

Below is a pictorial of our journey.



All code below was executed using Python 3.7.0.

Add Sequence Numbers to a List


We are going to begin by adding a sequence number to the items in a list



>>> seasons = ['Spring', 'Summer', 'Fall', 'Winter']

>>> list(enumerate(seasons,start=10))
[(10, 'Spring'), (11, 'Summer'), (12, 'Fall'), (13, 'Winter')]

Create Enumerate Equivalent Function Using Yield


An equivalent function for enumerate is provided below.
def enumerate_function( sequence, start = 0 ):
    n = start
    for element in sequence:
        yield n, element
        n += 1
Next, let's exercise the above function.








seasons = [ 'Spring', 'Summer', 'Fall', 'Winter' ]


for sequence_number, season in enumerate_function( seasons, start = 20 ):
    print("The season ", season, " has been assigned a sequence number of ", sequence_number)
The generated output is:










The season  Spring  has been assigned a sequence number of  20
The season  Summer  has been assigned a sequence number of  21
The season  Fall  has been assigned a sequence number of  22
The season  Winter  has been assigned a sequence number of  23
Notice the yield statement in enumerate_function.

Let's examine the execution of the enumerate_function to determine what the yield statement is doing. The execution starts when the function is called and proceeds to the first yield statement. At that point in time, execution is suspended and the associated values are returned. During suspension, the full state of the function is retained. When the function is invoked once again, the execution of the function resumes immediately after the yield statement.

The above seems awfully complicated. You stop the execution of a function, you save its entire state, you return a value and you resume processing after the function is invoked once again. The cost of this added complexity is worth it because it reduces the amount of required memory. If the values are not generated as they are needed, they would have to be all created at the beginning and stored in memory. For large data sets, a large amount of memory would be required.

If a function has a yield statement, it is called a "generator function". It is important to differentiate between a "generator function" and a "normal function". In the former, the function can be executed many times; while the latter is executed only once.


For Illustrative Purposes, Replace For Loop with While Loop


The phrase "When the function is invoked once again" needs to be further explained. As usual, we are going to start with a code snippet. Specifically, we are going to replace the for loop in the enumerate_function with an infinite while loop helped by iter() and next().


Below is the updated code snippet.







def enumerate_function_rev_1( sequence, start = 0 ):
    n = start
    iterator = iter(sequence)
    while True:
        try:
            yield n, next(iterator)
        except StopIteration:
            break
        n += 1
Also, as usual, we will explain the concepts after providing a code snippet. The iter() function returns an iterator object. The next() function retrieves the next item from the iterator. The infinite while loop continues to retrieve data until the StopIteration exception is raised.

We originally added complexity to a for loop by introducing the yield statement. Now we have added more complexity by adding the concept of an iterator with its supporting functions. We are doing this for illustrative purposes only.

Here, we are trying to illustrate that under the hood, the for loop uses an iterator. Also, it provided an opportunity to introduce the concept of an iterator. The remainder of the blog post will expand on this concept. Please note that you should continue to use the for loop and not replace it with an infinite while loop.

By the way, the use of iterator with its supporting functions is commonly referred to as the "iterator protocol".

Iterator Performance Benefit: Less Memory Consumed


Why bother with the concept of the iterator protocol? The one word answer: performance. Please note that the preceding code snippets were only used to demonstrate concepts and were not intended to address performance concerns.

Below is a code snippet to generate 10 million numbers. When dealing with big data, this is not an unreasonable number.

import sys

a_list = [1 for i in range(10000000) ]
a_generator = (1 for i in range(10000000) )

sys.getsizeof(a_list)      # 81,528,056 Bytes
sys.getsizeof(a_generator) #        120 Bytes

The list requires 78 MB while the generator only requires 120 bytes. This is 679,400 times as much. Allocating over 600,000 times as much memory and doing this many times during processing will quickly impact performance.

The following is a quote from the Python official documenation: The superior memory performance is kept by processing elements one at a time rather than bringing the whole iterable into memory all at once. Code volume is kept small by linking the tools together in a functional style which helps eliminate temporary variables. High speed is retained by preferring “vectorized” building blocks over the use of for-loops and generators which incur interpreter overhead.

The idea that code volume is kept small by linking the tools together in a functional style will be demonstrated below with the dot product computation. Vectorization is beyond the scope of this blog. However, if this is of interest to you, a good place to start is "Look Ma, No For-Loops: Array Programming With NumPy" by Brad Solomon.

Use Dot Product Computation to Demonstrate Iterator Protocol


We can use the computation of the dot product to demonstrate the above. If you have list/vector a = [1, 3, -5] and list/vector b = [4, -2, -1], the dot product is "(1 * 4) + (3 * -2) + (-5 * -1)" which is equal to 3.

Below is a code snippet to implement the dot product





import operator
a = [1,  3, -5]
b = [4, -2, -1]

sum( map( operator.mul, a, b ) )
The key to understanding the above code snippet is map(function, iterable, ...). In this case, the function being applied is multiplication via operator.mul(a, b). Iterable 1 is list a. Iterable 2 is list b. Notice how the concept of an iterator allows us to write one line of easy to understand code to compute the dot product. For the other benefits, please re-read the paragraph that starts with "Why bother with the concept of the iterator protocol?".

For those familiar with Excel, the above corresponds to the SUMPRODUCT function.

List Comprehension


No conversation of this type would be complete without mentioning list comprehensions because they also use the concept of an iterator. As an example, let's create a list of the square of the numbers 0 through 9 inclusive.

>>> [x**2 for x in range(10)]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
A more complex example of a list comprehension is to combine the elements of two lists only if they are not equal.






>>> [(x, y) for x in [1,2,3] for y in [3,1,4] if x != y]
[(1, 3), (1, 4), (2, 3), (2, 1), (2, 4), (3, 1), (3, 4)]
Notice how (1,1) and (3,3) are not in the results. This arises from "if x != y".

There are also set and dictionary comprehensions. These two comprehensions are not conceptually that different from list comprehensions. Consequently, nothing more will be said about them in this blog post.

The thing to notice as you read the Python documentation is that a comprehension is a flavor of the special syntax called "displays". The following is a quote from the documentation:
    1. either the container contents are listed explicitly, or
      1. they are computed via a set of looping and filtering instructions, called a comprehension.

    Summary


    In summary, we started out adding a sequence number to a list by simply using Python's built-in enumerate function. We then explicitly implemented our own version of the enumerate function so that we could introduce the yield statement. The yield statement in turn led to the concept of the generator function.

    Also, just like we decomposed the built in function enumerate, we also decomposed the for loop. We replaced the for loop with an infinite while loop helped by iter() and next(). This allowed us to introduce the concept of the iterator object. We then discussed the performance benefits of using iterator objects which was demonstrated by computing the dot product.

    We closed by talking about list, set and dictionary comprehensions.