Friday, January 18, 2019

Roadmap to Developing an Understanding of Python Concurrency

We will provide a suggested roadmap for developing an understanding of concurrency in general and its implementation in Python.

The following questions will be addressed:
  1. What is concurrency?
    1. What are the different flavors of concurrency?
      1. How does Python implement concurrency?
        The road map will provide brief descriptions of key concepts and how they fit together. For those that want to delve deeper into any concept, links will be provided.

        Motivation for Blog Creation


        I was "volunteered" to give a presentation on Python concurrency at the Pittsburgh Python users group.

        One of the questions that come up for people is whether or not to go to the presentation. It is hard to determine this from a brief description. So, my solution is to create a blog post for the presentation ahead of time.

        Another question that comes up is how to manage the quantity of material presented in a short period of time. This will be especially true in the case of Python concurrency. So, having the blog post ahead of time will allow people to focus on their interactions with others and spend less time taking notes.

        Also, some will take advantage of the blog post to prepare questions and comments for the presentation.

        Definitions


        Dictionary.com defines concurrence as "simultaneous occurence". The phrase "from a human perspective" needs to be added. So, our definition of concurrence is "simultaneous occurence from a human perspective."

        Compared to computers, humans are really slow. I am going to pick 100 ms response time as being instantaneous from a human perspective. If you want to pick another value, refer to How Fast is Realtime? Human Perception and Technology by PubNub Staff. Computer instructions can be measured in Millions of Instructions Per Second (MIPS). So, even if you had just 1 MIP available, you could execute 100,000 instructions before a human would notice. Please note that an old Pentium based PC was rated at 100 MIPS.

        This human slowness allows us to play tricks when implementing software. Some of the tricks center around the ability of the software to switch between "stuff" without the human ever noticing. The "stuff" goes by names like threads and tasks. These names will be explicitly defined later in the blog.

        Number of Cores: Many - Multiprocessing


        One flavor of concurrency is multiprocessing. Let's begin to create a definition for multiprocessing with a picture.



        Here we have individuals juggling independently of each other. There is no interaction between them. So, a process is execution which shares no resources with another process. The operating system (OS) assigns a process resources and that is that. Depending on the computer load, the OS might suspend the process and assign those resources to another process. Note that there is no interaction between the suspended process and the new process. If more information is desired regarding multiprocessing, refer to the Wikipedia multiprocessing article.

        Working Only with Python Code - Use Python's Multiprocessing Module


        If more information is desired regarding Python's implementation of multiprocessing when only Python code is involved, refer to its multiprocessing module. It provides queues, events, worker pools, and other marshaling paradigms.

        Working with Non-Python Code - Use Python's Subprocess Module


        Use cases exist where have to concurrently invoke non-Python code. In such circumstances, the subprocess module should be used. Unfortunately, it does not provide queues, events, worker pools, and other marshaling paradigms. This is to be expected because each external interface can be different.

        Number of Cores: 1 - Multitasking


        If you aren't using the 2 previously mentioned modules, then execution can occur in only 1 core. The reason for this is Python's Global Interpreter Lock (GIL). For further details, refer to the blog What is the Python Global Interpreter Lock (GIL)? by Abhinav Ajitsaria.

        This other flavor of concurrency is called multitasking. In turn, multitasking has 2 sub-flavors: pre-emptive multitasking and cooperative multitasking.

        Pre-Emptive Multitasking (Threading)


        As before, we will begin to create a definition with a picture.



        If you notice, there is one individual and he can; at will, pick up a ball and place it in another location. This is done without examining the state of the ball. Pre-emptive multitasking is the same thing except that the individual is replaced by the OS and the balls are replaced by threads.

        Below are the standard threading questions
        1. Is the code thread safe? For further details, refer to "Wikipedia: Thread safety."
          1. Another way of asking the above question but using different words:
            1. Do you have any race conditions? For further details, refer to "Wikipedia: Race condition."
            2. How many threads to create? Experimentation is usually required to answer this question.
              Please note that the above is not "just 3 questions." Each question by itself involves a lot of complexity.

              Cooperative Multitasking (Asyncio)


              For the third time, we will begin to create a definition with a picture.



              If you notice the 3 individuals have to cooperate very closely in order to successfully juggle. This cooperation is the key to cooperative multitasking. Regrettably, that is as far as the analogy will take us this time.

              In the case of asyncio, the entities doing work are called tasks. Tasks cannot be interrupted. In other words, they are not thread-safe. However, the advantage of not being thread safe is that it simplifies them. Also, each task can be in a "ready to run" or "waiting" state. The coordination of the tasks is accomplished via the simple concept of an Event Loop. This again reduces the complexity.

              The advantage of cooperative multitasking is that don't have to worry about thread safety. The con is the added complexity of writing the software to make sure that everything cooperates.

              Use Cases


              Now that we have discussed the concepts and tools, let's determine when they should be used.

              The blog post Speed Up Your Python Program With Concurrency by Jim Anderson contains a really nice set of code snippets that demonstrate the tools. Its code snippets will be extensively used below.

              All code was executed using Python 3.6.7.

              Use Case 1: I/O Bound


              The I/O bound condition is created by the download of internet web pages. In this situation, the CPU spends a large portion of its time waiting for data to arrive over the internet. 


              Non-Concurrent Code


              To go to the non-concurrent code from Jim Anderson's blog, click here. The code requires that the requests module is installed.

              Pro: Code simplicity

              Con: Slow compared to the other approaches.

              Use Threading


              To go to the code that uses threading from Jim Anderson's blog, click here.

              Notice the change in the download_all_sites function between the non-concurrent code and the code that uses threading. The corresponding code snippets are below.

              Non-Concurrent Code
              
              
              def download_all_sites(sites):
                  with requests.Session() as session:
                      for url in sites:
                          download_site(url, session)
              
              
              Threading: Code Snippet 1
              
              
              def download_all_sites(sites):
                  with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
                      executor.map(download_site, sites)
              
              
              Let's examine ThreadPoolExecutor. As the name implies, you get a thread, a pool and an executor. In this particular case, you get a pool of 5 threads. The executor is the guy that does the co-ordination.

              Thread safety is addressed by using Thread-Local Data in the get_session function. The associated code snippet is below. Notice that created the thread_local object once and not once for each thread.

              Threading: Code Snippet 2
              threadLocal = threading.local()
              
              def get_session():
                  if not getattr(thread_local, "session", None):
                      thread_local.session = requests.Session()
                  return thread_local.session
              
              
              Please note that requests.Session() is not thread-safe.

              Pro: Faster than non-concurrent code

              On my laptop, the code that used threading was about 0.2 seconds faster than the non-concurrent code. This may not seem like much; but at scale, it makes a difference. Note that, the elapsed time is a good way to start assessing performance. If more precise measurements are needed consider using timeit with process_time().

              Con: Dramatic increase in complexity

              We now have to deal with
              1. Executors
                1. Pools
                  1. Thread Safety
                    1. Threads

                    Use Asyncio


                    To go to the code that uses asyncio from Jim Anderson's blog, click here. The code requires that the aiohttp module is installed.

                    Once again, let's focus on the download_all_sites function.

                    Asyncio Code
                    
                    
                    async def download_all_sites(sites):
                        async with aiohttp.ClientSession() as session:
                            tasks = []
                            for url in sites:
                                task = asyncio.ensure_future(download_site(session, url))
                                tasks.append(task)
                            await asyncio.gather(*tasks, return_exceptions=True)
                    
                    Lets start by noticing that the keyword async prefixes the function definition. This turns the function into a coroutine. Coroutines are code blocks that have the ability to yield execution to other code blocks.

                    Next, notice that Session is created only once and shared among all the tasks. This is possible because a task can't be interrupted and leave the Session in a bad state. So, the tasks can share the Session.

                    The tasks are created and started using asyncio.ensure_future(). If you are using Python 3.7 or later, you can use asyncio.create_task() instead. To ensure that all the tasks are completed before the Session is terminated, asyncio.gather() is used.

                    The await keyword is how the task hands control back to the event loop.

                    Now that we have the tasks, we need to start the event loop and give it the tasks. This is accomplished with the following in __main__


                    If you are using Python 3.7 or higher, you can use asyncio.run() instead.

                    Pro: Noticeably faster than using threads

                    On my laptop, the code that used asyncio was about 5 seconds faster than the code using threads. Notice going from non-current code to threading only improved performance by about 0.2 seconds. Going from threading to asyncio, performance improved by about 5 seconds. This is over an order of magnitude faster!

                    Con: More complex syntax

                    The code becomes littered with "async." You need to create and start tasks. You then need to make sure that all the tasks complete before the session is terminated. Last but not least, you have to take care of the event loop.

                      Use asyncio when you can - Threading when you must


                      The following is a direct quote from Jim Anderson's article
                      1. For I/O-bound problems, there’s a general rule of thumb in the Python community: “Use asyncio when you can, threading when you must.”
                        This is such a critical rule of thumb that it deserves its own section.

                        Use Multiprocessing


                        To go to the code that uses multiprocessing from Jim Anderson's blog, click here.

                        As in the previous cases, let's focus on the download_all_sites function.

                        Multiprocessing Code Snippet 1
                        
                        
                        def download_all_sites(sites):
                            with multiprocessing.Pool(initializer=set_global_session) as pool:
                                pool.map(download_site, sites)
                        
                        Lets start by noticing multiprocessing.Pool(). It creates a pool of worker processes to which jobs can be submitted. Since the processes parameter was not specified, it defaults to the number of CPUs in the system. Picking the number of worker processes is just as hard as picking the number of threads. The other thing to notice is that it has a parallel map implementation. So "pool.map(download_site, sites)", maps each URL to the function download_site.

                        Pro: Can take advantage of all the cores on the computer.

                        Con: Complexity of trying to align Session and processes results in the weirdness of "global session." See code snippet below.

                        Multiprocessing Code Snippet 2
                        
                        
                        session = None
                        
                        def set_global_session():
                            global session
                            if not session:
                                session = requests.Session()
                        
                        Since every process has its own memory, they can not share a Session object. At the other extreme, you don't want to create a Session each time the function is called. The goal is to create a Session for each process. This is accomplished by using the initializer parameter as shown in "Multiprocessing Code Snippet 1."

                        In terms of coding complexity, multiprocessing code is on par with the code that uses threading. In terms of performance, at least on my laptop, it was slower than asyncio but faster than the other 2 approaches. Below is a table of the run times.

                        Approach        Run Time (seconds)
                        --------------  ------------------
                        No Concurrency  7.4
                        Threading       7.3
                        Multiprocessing 5.8
                        Asycnio         2.2
                        
                        Notice how much faster asyncio is.

                        Use Case 2: CPU Bound


                        We will continue to use the code snippets from Jim Anderson's blog. These code snippets compute the sum of the square of each number from 1 to a passed in value to create a CPU bound condition.

                        Non-Concurrent Code


                        To go to the non-concurrent code from Jim Anderson's blog, click here. Nothing much to say. We should be able to do better using multiple cores.

                        Use Threading


                        To go to the code that uses threading from Jim Anderson's blog, click here. Again, nothing much to say. As expected, the threading approach takes longer than the non-current approach.

                        Use Multiprocessing


                        To go to the code that uses multiprocessing from Jim Anderson's blog, click here.

                        Just like in the I/O bound multiprocessing case, multiprocessing.Pool() and pool.map() are used.

                        Multiprocessing Code Snippet
                        
                        
                        def find_sums(numbers):
                            with multiprocessing.Pool() as pool:
                                pool.map(cpu_bound, numbers)
                        
                        As expected multiprocessing shines on CPU bound problems. It was over 3 times faster than the non-concurrent version of the code.

                        Pro: Significant performance improvement with little-added code complexity.

                        The very little code complexity arises because nothing is shared.

                        Con: Splitting up the problem so that each processor can work independently can be difficult.

                        This issue did not show up in the trivial example that we did. Sometimes, problems are not embarrassingly parallel.

                        Use Case 3: Python Concurrent Interaction with Non-Python Code


                        To Do: Put together code that cycles through directories that invokes Java to convert DocBook XML to HTML.

                        As an aside, the above points to the fact that you can use Python instead of bash.

                        Alternatives


                        There are many other options to achieve concurrency. For example, Cython supports native parallelism through the cython.parallel module. A good place to start exploring these other alternatives is "General concepts: concurrency, parallelism, threads and processes" at gevent.

                        Perhaps Concurrency Cost is Not Worth It?


                        As the preceding material indicates, concurrency is extremely complex with a lot of nuances. Which form of concurrency do I use? Is my code thread safe? Did I make sure that all my tasks completed prior to the Session terminating? And so on.

                        One option might be to just wait. The business people want to be able to run the report several times during the day but they aren't willing to allocate resources to make this happen. In that case, they will have to wait for the overnight run to complete.

                        Another option is to just to apply brute force. Buy a bigger machine or faster disks. There is nothing wrong with brute force.



                        Let me be very blunt: DO NOT USE CONCURRENCY UNTIL YOU HAVE EXHAUSTED ALL OTHER OPTIONS!

                        Summary


                        Concurrency is a big topic. Python's implementation of concurrency is an even bigger topic. The goal of the preceding material was to provide a roadmap for learning about concurrency and Python's implementation of it at a high level. The roadmap has embedded links which provide jumping off points for further detailed learning.

                        References

                        1. aiohttp module
                          1. Articles
                          2. Books
                            1. Concurrent Patterns and Best Practices by Atul S. Khot
                            2. Distributed Computing with Python by Francesco Pierfederici
                            3. Mastering Concurrency in Python by Quan Nguyen
                            4. Parallel Programming with Python by Jan Palach
                            5. Python High Performance Programming by Gabriele Lanaro
                            6. Python Parallel Programming Cookbook by Giancarlo Zaccone
                          3. Dask
                          4. PGH Python User's Group Meetup: Python Concurrency Guided Tour
                            1. Slides
                          5. Python Documentation
                          6. requests module
                          7. Videos