The Bit Plumber: February 2020

If you don't have much time to devote to the topic, just watch the video Small Big Data: using NumPy and Pandas when your data doesn't fit in memory by Itamar Tuner-Trauring at PyData New York 2019.

This won't be a conventional blog post. It won't have paragraph after paragraph. Instead, it will be a list around which people can have conversations. The links within the list and the references at the end will provide the details.

The majority of the list items below come from Itamar Tuner-Trauring video mentioned earlier.

1. Small Big Data: too big to fit in memory but too small for a complex big data cluster

a. Load only what you need

i. Do you really need to load all the columns?

b. Use the appropriate data type

i. If possible, use NumPy 16 bit unsigned integers as opposed to its 64-bit version.

ii. If possible, use Pandas category rather than an object data type.

c. Divide data into chunks - Process 1 chunk at a time

i. pandas.read_csv( ... chunksize = None ...)

ii. Zarr: chunked, compressed, N-dimensional arrays

d. Use Generator Expression: lazily generated iterable objects

i. Python’s Generator Expressions: Fitting Large Datasets intoMemory by Luciano Strika

A. Commonly used example


with open('big.csv') as f:

for line in f:
process(line)

e. Indexing: only need a specific subset of the data

i. populate SQLite from Pandas

A. load from SQLite into DataFrame

ii. vroom R package

f. If your data is sparse, take advantage of it.

i. Pandas - Sparse Data Structures

ii. sparse.pydata.org

g. Approaches Involving Approximations

i. Requires application of domain knowledge

ii. Just sample your data.

A. pd.read_csv(..., skiprows=lambda x: x % 2 != 0)

iii. Probabilistic Data Structures

A. Introduction to Probabilistic Data Structures by Yin Niu

h. Streaming

i. Streamz

i. Use a single machine distributed framework

i. Single Machine: dask.distributed

j. Just load your data into a database like PostgreSQL.

k. Just use the disk.

i. It will be slow but perhaps acceptable until can implement a better solution.

l. Money Solution

i. Buy or rent computers with large amounts of RAM

2. Big Data: anything that is not "small big data"

a. Dask, Spark, ...

References

1. NumPy Reference

a. Array Objects

i. Data Type Objects (dtype)

2. Pandas

a. Getting Started

i. Essential Basic Functionality

A. dtypes

3. Python Documentation

a. Language Reference

i. Expressions

A. Atoms

I. Generator Expressions

4. Small Big Data: using NumPy and Pandas when your data doesn't fit in memory by Itamar Tuner-Trauring at PyData New York 2019

5. What are all the dtypes that pandas recognize? at StackOverFlow.Com

6. When to use Category rather than Object? at StackOverFlow.Com

The Bit Plumber

Wednesday, February 12, 2020

How do you handle the situation when your data does not fit into memory?

The majority of the list items below come from Itamar Tuner-Trauring video mentioned earlier.

References