Generating synthetic data has been a topic of interest since the early days of conventional databases. Recently, it has become a hot topic because of the large language model (LLM) phenomena. Basically everyone has scrapped the internet and now more data is needed. Also, higher quality data is needed. For further details, refer to the video "Jonathan Ross, Founder & CEO @ Groq: NVIDIA vs Groq - The Future of Training vs Inference | E1260" at 20VC with Harry Stebbings. Start listening 2 minutes in and then just listen for 3 minutes.
This blog post contains a list of references which hopefully will save others some leg work.
Articles
If you're looking for a Python library to perform differential privacy computations, diffprivlib seems to be an attractive choice. You'll find it prominently featured in Google search results. It's maintained by IBM, and extensively cited in the scientific literature. Its README states that you can use it to "build your own differential privacy applications"1, and it's regularly updated. Last but not least, it's very easy to pick up: its API mimics well-known tools like NumPy or scikit-learn, making it look simple and familiar to data scientists.
Unfortunately, diffprivlib is flawed in a number of important ways. I think most people should avoid using it. This blog post lists a few reasons why.
Synthetic Data by Robert Riemann at the European Data Protection Supervisor
Books
A Hands-On Guide to Machine Learning with R by Norman Matloff
Part V: Applications
Chapter V.12: Image Classification
Tricks of the Trade
Data Augmentation
Designing Machine Learning Systems by Chip Huyen
Chapter 4: Training Data
Data Augmentation