Introduction
Once upon a time Smith had to obtain new transportation. He had to first decide whether he was going to buy a car or truck. Once he decided to buy a car, he had to decide between the various brands of cars like Ford and GM. Luckly, once Smith decided that he wanted a car; Mary (Smith's wife) decided to help with the research. Consequently, the Ford and GM car models could be researched concurrently.
Generalized Problem Statement
The above story can be used to create a generalized problem statement. For example, Smith has to decide whether he wants a car or a truck before he conducts any other research. This can be generalized to say that all categories in level 1 (car, truck, ...) must be processed before any level 2 categories (Ford, GM, ...) can be addressed.
We can continue with this generalization by saying that all level 2 categories can be processed concurrently. Remember that Smith can research Ford cars while Mary researches GM cars. In other words, the research of Ford and GM cars can happen concurrently.
The above has created a hierarchy of categories 2 levels deep. There are already many good examples on the internet that demonstrate parallelizing a hierarchy 2 levels deep. One example of this is "How to reduce the time complexity of nested loops" by Leandro Proença. So, we are going to add a third level to the hierarchy to make the problem more realistic. The other additional twist is that all level 3 categories associated with a category level 2 can be processed in parallel once the level 2 category has been processed. Luckly for us, we can assume that the processing of each category is approximately the same.
Below is a pictoral of the category hierarchy that will be used in the proof of concept code (input_data.py). For a PDF version of the data click here. For a spreadsheet version of the data click here.
Once all the categories in all the hierarchies are processed, summary counts are computed.
The above problem is a little complicated. This is done on purpose. There are already many good examples on the internet that demonstrate parallelizing simple problems. My goal in this blog is to introduce some complexity in order to reflect a more real world problem.
For those who prefer to go to straight to the code, click here. For those who prefer a description of the problem as well as the architecture used to solve the problem, please continue reading.
Nested For-Loops
For those who prefer to go to straight to the code for nested for-loops, click here. For those who prefer a description as to the how and why of the code, please continue reading.
One of the simplest things to do is to go through the 3 levels of the hierarchy and process all the combinations via nested for-loops. Think of it as examining a data cube. Sometimes people can get carried away with nested for-loops. Below is a picture to emphasize this.
The nested for-loop portion of the code is below.
for category_lvl_1 in categories_lvl_1:
try:
categories_lvl_2 = categories_lvl_1_2[category_lvl_1]
for category_lvl_2 in categories_lvl_2:
common_processing.process_categories_non_lvl_1(
2, category_lvl_1, category_lvl_2
)
try:
categories_lvl_3 = categories_lvl_2_3[category_lvl_2]
for category_lvl_3 in categories_lvl_3:
common_processing.process_categories_non_lvl_1(
3, category_lvl_2, category_lvl_3
)
except KeyError:
...
except KeyError:
...
Notice that the processing of the combination category N and (N + 1) is identical and is encapsulated by common_processing.process_categories_non_lvl_1(...). We will use this fact when we parallelize the processing. Also the processing of level 1 categories [process_categories_lvl_1()] as well as the computing of summary counts [summary_counts()] is included inn the common_processing module. In other words we have prepared the code for parallelization.
There are several issues with using nested for-loops. The use of them implies an ordering even though one may not exist. In our case there is no ordering because all level 2 categories can be processed in parallel.
The other issue is that nested for-loops are often the source of performance issues. In our case O(N3).
Also, if you notice the above code has try/except blocks because all the generated combinations do not necessarily exist. It would be better not to generate combinations that don't exist and in so doing avoid adding try/except code blocks.
Concurrently Process All Level 2 Categories
For those who prefer to go to straight to the code for concurrently processing all level 2 categories, click here. For those who prefer a description as to the how and why of the code, please continue reading.
The nature of the problem is such that it allows for the parallel processing of all level 2 categories. The ideal translation of this into code is as follows:
with Pool() as pool:
pool.imap_unordered(process_category_lvl_2, get_level_2_categories())
pool.close()
pool.join()
All that is needed is to code up process_category_lvl_2() and get_level_2_categories(). The implementation of process_category_lvl_2() involves the following steps
Obtain the category level 1 associated with the input level 2 category.
Process the level 1, 2 category combination.
Process the level 2, 3 combinations for the input level 2 category.
The implementation of get_level_2_categories() is straight forward. In most cases it will involve a simple query to a database.
Consequently, the main function is simply
common_processing.process_categories_lvl_1()
with Pool() as pool:
pool.imap_unordered(process_category_lvl_2, get_level_2_categories())
pool.close()
pool.join()
common_processing.summary_counts()
Notice how much simpler the above code is compared to the nested for-loops described previously.
The multiprocessing.pool.Pool() is being used as a throttling mechanism. We don't want to flood the system with a large number of concurrent processes each fighting for resources.
If you want to impress people with big words, you can say that you used the "Single Instruction, Multiple Data (SIMD)" design pattern. Another set of big words that you can use is "embarrassingly parallel".
Nested For-Loops was a Lookup Problem in Disguise
Notice how a nested for-loop was actually a lookup problem in disguise.
There is an initial lookup to get the set of level 2 categories.
There is another lookup to get the level 1 category associated with a level 2 category.
There is a final lookup to get all the level 3 categories associated with a level 2 category.
Further Parallelization
Let's try to take advantage of the fact that all level 3 categories associated with a completed level 2 category can be processed in parallel.
One approach is to pass the pool to the concurrent processes being executed. Unfortunately, this won't work because "the methods of a pool should only ever be used by the process which created it."
Another approach is to create a Pool() in the concurrent processes being executed. Unfortunately, this won't work because "cannot create a Pool in pooled function."
Unfortunately, this means that will have to re-architect the approach. Luckly for me, just concurrently executing level 2 categories is good enough at this point in time.
However, for those who have use cases that require further parallelization, one can use Managers. It provides a workaround for the inability to directly share a process pool. For the details, refer to the article "Share a Multiprocessing Pool With Workers" by Jason Brownlee.
The danger of using a Manager is that you might end up creating your own framework for concurrent execution. Perhaps, it woud be best to just use one of the existing frameworks? Below is a partial list of these types of frameworks to help you get started in their exploration.
Also, you might get lucky and meet performance goals by using Numba ("just" add the @jit decorator). For additional details, refer to the article "Explicit Parallel Loops" at Numba.ReadTheDocs.Io
Don't Use Python
Ultimately, if you want blazing performance, you'd have to convert from Python to some other language. One alternative is Cython. Another alternative is Julia (Parallel Map and Loops). There are many others. If this type of approach is used, please consider asking if the overhead added by another language in your infrastructure worth it?
Miscellaneous Technical Details
Since all the category processing has to complete before computing summary counts, have to issue a pool close() followed by a pool join(). For the details of why this is necessary, refer to the article "7 Multiprocessing Pool Common Errors in Python" by Jason Brownlee. Unfortunately, can't use the pool context manager to accomplish this goal. For the details of why this is, refer to the article "Multiprocessing Pool Context Manager" by Jason Brownlee.
The other thing is that to gurantee that print statements will actually show up, the associated buffer must be flushed. For the details refer to the following articles
To Dos
Unfortunately, there is an infinite number of things to do and a limited amount of time. Consequently, I have left out a few things which I hope to get to at a later point.
The first are tests to ensure that the parallel solution yields the same results as the naive nested loop solution. Currently, I manually compare the files nested_for_loops_output.txt and parallelize_level_2_categories_output.txt to see that the same exact category pair processing is happening. This can not easily be automated because the order of the category pair processing will be different in the two files.
The second is benchmarking/profiling. In my particular case, I will fix the number of lines of text to categorize and then vary the number of physical cores and the number of processes created by Pool(). For guidance on setting the number of processes created by Pool() refer to the following articles
Summary
The selected use case is slightly complex in order to better reflect a real world problem. Also, it enables a discussion of tradeoffs that always needs to be made.
One thing that has not been mentioned so far is that it is a must to write down the definition of done before doing anything. In my particular use case the definition of done is to map x number of lines of text to categories is one hour or less. The mapping is done by process_categories_lvl_1() and process_categories_non_lvl_1() in the common module.
The other thing that needs to be written down are the constraints. In my particular use case, we don't want to modify the code in the common_processing module because they are highly complex. Consequently, the easiest way to reduce processing time is to execute all the level 2 category processing in parallel.
Lastly, it should be noted that there is no general solution to parallelizing nested for-loops. Each use case is different. In the use case addressed in this blog we took advantage of the fact that we could process all level 2 categories in parallel. However, for some general guidelines on parallelizing nested for-loops, refer to the article "Parallel Nested For-Loops in Python" by Jason Brownlee. The other thing to note is that it is not always possibe to parallelize nested for loops.
References
The purpose of the reference section is to make it easy for the reader to find things. Also, the key passages within the link is provided. This way don't have to search an entire article to find the applicable material.
The reference section is lengthy and so it has been placed on a separate web page. To go to it, click here.