If
you don't have much time to devote to the topic, just watch the video Small Big Data: using NumPy and Pandas when your data
doesn't fit in memory by Itamar Tuner-Trauring at PyData New York 2019.
This
won't be a conventional blog post. It won't have paragraph after paragraph.
Instead, it will be a list around which people can have conversations. The
links within the list and the references at the end will provide the details.
The majority of the list items below come from Itamar Tuner-Trauring video mentioned earlier.
1. Small Big Data: too big to fit in memory but too
small for a complex big data cluster
a. Load only what you need
i.
Do you really need to
load all the columns?
b. Use the appropriate data type
i.
If possible, use NumPy
16 bit unsigned integers as opposed to its 64-bit version.
ii.
If possible, use Pandas
category rather than an object data type.
c. Divide data into chunks - Process 1 chunk at a
time
d. Use Generator Expression: lazily generated
iterable objects
A. Commonly used example
with open('big.csv') as f:
for line in f:
process(line)
e. Indexing: only need a specific subset of the
data
i.
populate SQLite from
Pandas
A. load from SQLite into DataFrame
ii.
vroom
R package
f. If your data is sparse, take advantage of it.
g. Approaches Involving Approximations
i.
Requires application of
domain knowledge
ii.
Just sample your data.
iii.
Probabilistic Data
Structures
h. Streaming
i.
Streamz
i. Use a single machine distributed framework
j. Just load your data into a database like
PostgreSQL.
k. Just use the disk.
i.
It will be slow but
perhaps acceptable until can implement a better solution.
l. Money Solution
i.
Buy or rent computers
with large amounts of RAM
2. Big Data: anything that is not "small big
data"
a. Dask, Spark, ...
References
1. NumPy Reference
a. Array Objects
2. Pandas
a. Getting Started
i.
Essential Basic
Functionality
A. dtypes
3. Python Documentation
a. Language Reference
i.
Expressions
A. Atoms