Generating synthetic data has been a topic of interest since the early days of conventional databases. Recently, it has become a hot topic because of the large language model (LLM) phenomena. Basically everyone has scrapped the internet and now more data is needed. Also, higher quality data is needed. For further details, refer to the video "Jonathan Ross, Founder & CEO @ Groq: NVIDIA vs Groq - The Future of Training vs Inference | E1260" at 20VC with Harry Stebbings. Start listening 2 minutes in and then just listen for 3 minutes.

This blog post contains a list of references which hopefully will save others some leg work.

Articles

Don't use diffprivlib by Ted
1. If you're looking for a Python library to perform differential privacy computations, diffprivlib seems to be an attractive choice. You'll find it prominently featured in Google search results. It's maintained by IBM, and extensively cited in the scientific literature. Its README states that you can use it to "build your own differential privacy applications"1, and it's regularly updated. Last but not least, it's very easy to pick up: its API mimics well-known tools like NumPy or scikit-learn, making it look simple and familiar to data scientists.
2. Unfortunately, diffprivlib is flawed in a number of important ways. I think most people should avoid using it. This blog post lists a few reasons why.
Synthetic Data by Robert Riemann at the European Data Protection Supervisor

Books

A Hands-On Guide to Machine Learning with R by Norman Matloff
1. Part V: Applications
  1. Chapter V.12: Image Classification
    Tricks of the Trade
    Data Augmentation
Designing Machine Learning Systems by Chip Huyen
1. Chapter 4: Training Data
  1. Data Augmentation
Practical Synthetic Data Generation by Khaled El Emam, ...
Synthetic Data and Generative AI by Vincent Granville
Synthetic Data for Deep Learning by Sergey I. Nikolenko
Synthetic Data for Machine Learning by Abdulrahman Kerim

The Bit Plumber

Tuesday, February 25, 2025

Synthetic Data Generation References

Articles

Books

Vendors