Friday, August 28, 2020

Python Iterable: Sequences, Non-Sequence Iterables and So On

Iterable is a key concept in Python. Think of it as allowing you to do a for loop over "stuff." This includes being able to do for loops over non-sequences like dictionaries and sets.

Trying to determine which of Python's various data types support the iterable concept can be tedious. Below is a table that helps with that. Also, the links in the table go to the corresponding Python documentation.

Please note that the table is not meant to be an exhaustive enumeration of where an iterable is used. The goal is to help people get started on their journey. For example, the table contains collections.abc.Sequence but it does not contain the other applicable abstract base classes for containers.

Technical Note: I couldn't get Blogger.Com to properly render the table. Consequently, I have included a screenshot of the table below. To go to the table with the links, click here.





Wednesday, February 12, 2020

How do you handle the situation when your data does not fit into memory?


This won't be a conventional blog post. It won't have paragraph after paragraph. Instead, it will be a list around which people can have conversations. The links within the list and the references at the end will provide the details.


The majority of the list items below come from Itamar Tuner-Trauring video mentioned earlier.


1.     Small Big Data: too big to fit in memory but too small for a complex big data cluster
a.      Load only what you need
                                       i.           Do you really need to load all the columns?
b.     Use the appropriate data type
                                       i.           If possible, use NumPy 16 bit unsigned integers as opposed to its 64-bit version.
                                     ii.           If possible, use Pandas category rather than an object data type.
c.      Divide data into chunks - Process 1 chunk at a time
                                       i.           pandas.read_csv( ... chunksize = None ...)
                                     ii.           Zarr: chunked, compressed, N-dimensional arrays
d.     Use Generator Expression: lazily generated iterable objects
                                       i.           Python’s Generator Expressions: Fitting Large Datasets intoMemory by Luciano Strika
A.   Commonly used example

with open('big.csv') as f: 
    for line in f: 
        process(line) 

e.      Indexing: only need a specific subset of the data
                                       i.           populate SQLite from Pandas
A.   load from SQLite into DataFrame
                                     ii.           vroom R package
f.      If your data is sparse, take advantage of it.
                                       i.           Pandas - Sparse Data Structures
                                     ii.           sparse.pydata.org
g.     Approaches Involving Approximations
                                       i.           Requires application of domain knowledge
                                     ii.           Just sample your data.
                                   iii.           Probabilistic Data Structures
h.     Streaming
                                       i.           Streamz
i.       Use a single machine distributed framework
                                       i.           Single Machine: dask.distributed
j.       Just load your data into a database like PostgreSQL.
k.     Just use the disk.
                                       i.           It will be slow but perhaps acceptable until can implement a better solution.
l.       Money Solution
                                       i.           Buy or rent computers with large amounts of RAM
2.     Big Data: anything that is not "small big data"
a.      Dask, Spark, ...


References


1.     NumPy Reference
a.      Array Objects
                                       i.           Data Type Objects (dtype)
2.     Pandas
a.      Getting Started
                                       i.           Essential Basic Functionality
A.   dtypes
3.     Python Documentation
a.      Language Reference
                                       i.           Expressions
A.   Atoms
                                                                                  I.           Generator Expressions