Wednesday, February 12, 2020

How do you handle the situation when your data does not fit into memory?


This won't be a conventional blog post. It won't have paragraph after paragraph. Instead, it will be a list around which people can have conversations. The links within the list and the references at the end will provide the details.


The majority of the list items below come from Itamar Tuner-Trauring video mentioned earlier.


1.     Small Big Data: too big to fit in memory but too small for a complex big data cluster
a.      Load only what you need
                                       i.           Do you really need to load all the columns?
b.     Use the appropriate data type
                                       i.           If possible, use NumPy 16 bit unsigned integers as opposed to its 64-bit version.
                                     ii.           If possible, use Pandas category rather than an object data type.
c.      Divide data into chunks - Process 1 chunk at a time
                                       i.           pandas.read_csv( ... chunksize = None ...)
                                     ii.           Zarr: chunked, compressed, N-dimensional arrays
d.     Use Generator Expression: lazily generated iterable objects
                                       i.           Python’s Generator Expressions: Fitting Large Datasets intoMemory by Luciano Strika
A.   Commonly used example

with open('big.csv') as f: 
    for line in f: 
        process(line) 

e.      Indexing: only need a specific subset of the data
                                       i.           populate SQLite from Pandas
A.   load from SQLite into DataFrame
                                     ii.           vroom R package
f.      If your data is sparse, take advantage of it.
                                       i.           Pandas - Sparse Data Structures
                                     ii.           sparse.pydata.org
g.     Approaches Involving Approximations
                                       i.           Requires application of domain knowledge
                                     ii.           Just sample your data.
                                   iii.           Probabilistic Data Structures
h.     Streaming
                                       i.           Streamz
i.       Use a single machine distributed framework
                                       i.           Single Machine: dask.distributed
j.       Just load your data into a database like PostgreSQL.
k.     Just use the disk.
                                       i.           It will be slow but perhaps acceptable until can implement a better solution.
l.       Money Solution
                                       i.           Buy or rent computers with large amounts of RAM
2.     Big Data: anything that is not "small big data"
a.      Dask, Spark, ...


References


1.     NumPy Reference
a.      Array Objects
                                       i.           Data Type Objects (dtype)
2.     Pandas
a.      Getting Started
                                       i.           Essential Basic Functionality
A.   dtypes
3.     Python Documentation
a.      Language Reference
                                       i.           Expressions
A.   Atoms
                                                                                  I.           Generator Expressions