Friday, August 28, 2020

Python Iterable: Sequences, Non-Sequence Iterables and So On

Iterable is a key concept in Python. Think of it as allowing you to do a for loop over "stuff." This includes being able to do for loops over non-sequences like dictionaries and sets.

Trying to determine which of Python's various data types support the iterable concept can be tedious. Below is a table that helps with that. Also, the links in the table go to the corresponding Python documentation.

Please note that the table is not meant to be an exhaustive enumeration of where an iterable is used. The goal is to help people get started on their journey. For example, the table contains collections.abc.Sequence but it does not contain the other applicable abstract base classes for containers.

Technical Note: I couldn't get Blogger.Com to properly render the table. Consequently, I have included a screenshot of the table below. To go to the table with the links, click here.





Wednesday, February 12, 2020

How do you handle the situation when your data does not fit into memory?


This won't be a conventional blog post. It won't have paragraph after paragraph. Instead, it will be a list around which people can have conversations. The links within the list and the references at the end will provide the details.


The majority of the list items below come from Itamar Tuner-Trauring video mentioned earlier.


1.     Small Big Data: too big to fit in memory but too small for a complex big data cluster
a.      Load only what you need
                                       i.           Do you really need to load all the columns?
b.     Use the appropriate data type
                                       i.           If possible, use NumPy 16 bit unsigned integers as opposed to its 64-bit version.
                                     ii.           If possible, use Pandas category rather than an object data type.
c.      Divide data into chunks - Process 1 chunk at a time
                                       i.           pandas.read_csv( ... chunksize = None ...)
                                     ii.           Zarr: chunked, compressed, N-dimensional arrays
d.     Use Generator Expression: lazily generated iterable objects
                                       i.           Python’s Generator Expressions: Fitting Large Datasets intoMemory by Luciano Strika
A.   Commonly used example

with open('big.csv') as f: 
    for line in f: 
        process(line) 

e.      Indexing: only need a specific subset of the data
                                       i.           populate SQLite from Pandas
A.   load from SQLite into DataFrame
                                     ii.           vroom R package
f.      If your data is sparse, take advantage of it.
                                       i.           Pandas - Sparse Data Structures
                                     ii.           sparse.pydata.org
g.     Approaches Involving Approximations
                                       i.           Requires application of domain knowledge
                                     ii.           Just sample your data.
                                   iii.           Probabilistic Data Structures
h.     Streaming
                                       i.           Streamz
i.       Use a single machine distributed framework
                                       i.           Single Machine: dask.distributed
j.       Just load your data into a database like PostgreSQL.
k.     Just use the disk.
                                       i.           It will be slow but perhaps acceptable until can implement a better solution.
l.       Money Solution
                                       i.           Buy or rent computers with large amounts of RAM
2.     Big Data: anything that is not "small big data"
a.      Dask, Spark, ...


References


1.     NumPy Reference
a.      Array Objects
                                       i.           Data Type Objects (dtype)
2.     Pandas
a.      Getting Started
                                       i.           Essential Basic Functionality
A.   dtypes
3.     Python Documentation
a.      Language Reference
                                       i.           Expressions
A.   Atoms
                                                                                  I.           Generator Expressions

Thursday, September 5, 2019

Quick Explanation of the Python Generator


The goal of this blog post is to provide a quick explanation of what a generator is and a short code snippet demonstrating it.

Generators are functions that can be used as iterators. Another way of saying the same thing is that a generator is a function which may return as many times as needed, instead of just once.

A good way to demonstrate a generator is via the Fibonacci Sequence.

def fib():
    a,b = 0, 1
    while True:
        yield a
        a, b = b, a+b

for f in fib():
    if f > 100:
        break
    print(f)

The pro of generators is that they require less memory because you only generate the values as you need them. The con is that you are increasing the workload on your CPU and making your code more complex.

If you are interested in learning the details, please refer to the items:

3.     Python Documentation
a.      Glossary
                                                       i.            Asynchronous Generator
                                                     ii.            Asynchronous Generator Iterator
                                                  iii.            Asynchronous Iterable
                                                   iv.            Asynchronous Iterator
                                                     v.            Generator
                                                   vi.            Generator Expression
                                                vii.            Generator Iterator
                                              viii.            Iterable
                                                   ix.            Iterator
                                                     x.            Sequence
b.     PEPs
                                                       i.            PEP 255 - Simple Generators
                                                     ii.            PEP 342 - Coroutines via Enhanced Generators
                                                  iii.            PEP 525 - Asynchronous Generators
c.      Python Language Reference
                                                       i.            Compound Statements
                                                     ii.            Data Model
A.   Special Method Names
                                                                                                       I.            Emulating Containter Types
                                                  iii.            Expressions
A.   Atoms
                                                                                                       I.            Generator Expressions
                                                                                                    II.            Yield Expressions
a.      generator.__next__()
c.      generator.throw(...)
d.     generator.close()
e.      Examples
                                                   iv.            Simple Statements
d.     Python Standard Library
                                                       i.            Built-In Exceptions
A.   Concrete Exceptions
                                                                                                       I.            exception IndexError
                                                     ii.            Built-In Functions
A.   next(...)
                                                  iii.            Built-In Types
                                                                                                       I.            container.__iter__()
                                                                                                    II.            Generator Types
                                                                                                 III.            iterator.__iter__()
                                                                                                IV.            iterator.__next__()