Top 23 Python synthetic-data Projects

Mimesis

1 4 4,612 6.6 Python

Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.

Project mention: Mimesis: The Fake Data Generator That Will Blow Your Mind! | dev.to | 2025-05-08

View the Project on GitHub
Civic Auth

www.civic.com featured

Simple auth for Python backends. Drop Civic Auth into your Python backend with just a few lines of code. Email login, SSO, and route protection built-in. Minimal config. Works with FastAPI, Flask, or Django.
Kiln

2 14 4,084 9.9 Python

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

Project mention: Show HN: Kiln – AI Boilerplate with Evals, Fine-Tuning, Synthetic Data, and Git | news.ycombinator.com | 2025-07-28

I noticed there weren't boilerplates for AI projects like there were for web apps, so I built one. Same idea - everything you need to get a project up and running quickly. However, instead of web-framework/CSS/DB, it's tools for AI projects: evals, synthetic data gen, fine-tuning, and more.
Kiln is a free, open tool that gives you everything most AI projects need in one integrated package:
- Eval system: including LLM-as-judge evals, eval data generation, human baselines
- Fine-tuning: proxy to many fine-tuning providers like Fireworks/Together/OpenAI/Unsloth
- Synthetic data generation: deeply integrated into evals and fine-tuning
- Model routing: 12 providers including Ollama, OpenRouter, and more
- Git-based collaboration: projects are designed to be synced through your own git server
The key insight is that these tools work much better when they're integrated. For example, the synthetic data generator knows whether you're creating data for evals vs. fine-tuning (which have very different data needs), and evals can automatically test different prompt/model/fine-tune combinations.
It runs entirely locally - your project data stays in local files, and you control your own git repos. No external services required (though it integrates with them if you want).
Main project GitHub: https://github.com/Kiln-AI/Kiln
Demo GitHub where I use it to build a 'natural language to ffmpeg command' demo with evals, fine-tunes, and synthetic data (including demo video): https://github.com/Kiln-AI/demos/blob/main/end_to_end_projec...
BlenderProc

3 15 3,191 7.3 Python

A procedural Blender pipeline for photorealistic training image generation
SDV

4 59 3,139 9.5 Python

Synthetic data generation for tabular data
distilabel

5 2 2,862 7.8 Python

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

Project mention: Distilabel is a framework for synthetic data and AI feedback | news.ycombinator.com | 2025-01-28
curator

6 5 1,487 9.9 Python

Synthetic data curation for post-training and structured data extraction (by bespokelabsai)

Project mention: Ask HN: Is synthetic data generation practical outside academia? | news.ycombinator.com | 2025-06-06

https://github.com/bespokelabsai/curator
But it still feels very research-oriented. I haven’t found many examples of these pipelines running in real-world products.
I’m curious:
1. Who is using synthetic-data pipelines in production today?
2. What tasks does it actually improve. E.g. fine-tuning smaller models for specific tasks?
Any real-world stories, pointers, or further reading would be hugely appreciated. Thanks!
CTGAN

7 2 1,443 7.8 Python

Conditional GAN for generating synthetic tabular data.
InfluxDB

www.influxdata.com featured

InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
intellagent

8 2 1,117 9.3 Python

A framework for comprehensive diagnosis and optimization of agents using simulated, realistic synthetic interactions

Project mention: Making Sure AI Agents Play Nice: A Look at How We Evaluate Them | dev.to | 2025-05-01

When it comes to evaluating conversational agents, there are some smart ways to do it. Take a framework like IntellAgent. It uses AI to test other AI! It's a three-step process designed to make testing more thorough and realistic than just having a person manually try things out.
DataDreamer

9 6 1,051 7.1 Python

DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. 🤖💤

Project mention: Show HN: I built an AI dataset generator | news.ycombinator.com | 2025-06-26

this is sometimes called distillation. here is a robust example from some upenn students: https://datadreamer.dev/
bonito

10 1 788 4.4 Python

A lightweight library for generating synthetic instruction tuning datasets for your data without GPT. (by BatsResearch)
pygraft

11 1 690 6.0 Python

Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Fingertips
gretel-synthetics

12 4 656 6.6 Python

Synthetic data generators for structured and unstructured text, featuring differentially private learning.
mostlyai

13 1 622 9.7 Python

Synthetic Data SDK ✨

Project mention: Open-Source Synthetic Data SDK | news.ycombinator.com | 2025-02-22

MOSTLY AI has open-sourced its powerful Synthetic Data SDK, enabling you to create privacy-preserving, AI-generated synthetic data directly from your existing datasets—all within your secure environments.
Key Features:
Broad Data Support: Handle mixed data types (categorical, numerical, geospatial, text), single/multi-table datasets & time-series data.
Multiple Model Types: Leverage TabularARGN (SOTA for tabular data), fine-tuned HuggingFace models, and efficient LSTM for text generation.
Advanced Training Options: CPU/GPU support, differential privacy, and real-time progress monitoring.
Automated Quality Assurance: Built-in fidelity & privacy metrics with detailed HTML reports for visual data analysis.
Flexible Sampling: Upsample data, generate conditionally, rebalance segments, impute context-aware values, ensure fairness, and control outputs via temperature adjustments.
Seamless Integration: Connect effortlessly to external databases & cloud storage with a fully permissive open-source license.
Check out the SDK on GitHub: https://github.com/mostly-ai/mostlyai
Copulas

14 1 607 8.0 Python

A library to model multivariate data using copulas.
synthcity

15 4 582 6.4 Python

A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.
augraphy

16 1 451 5.0 Python

Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes

Project mention: All Data and AI Weekly #182 - 24-March-2025 | dev.to | 2025-03-24

⚡️ https://github.com/sparkfish/augraphy
Robotics-Object-Pose-Estimation

17 2 325 0.0 Python

A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.
zpy

18 9 310 0.0 Python

Synthetic data for computer vision. An open source toolkit using Blender and Python.
DoppelGANger

19 14 306 3.8 Python

[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
SDGym

20 1 277 8.3 Python

Benchmarking synthetic data generation methods.
edsl

21 3 270 10.0 Python

Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.
SDMetrics

22 2 246 8.8 Python

Metrics to evaluate quality and efficacy of synthetic datasets.
AgML

23 1 233 9.1 Python

AgML is a centralized framework for agricultural machine learning. AgML provides access to public agricultural datasets for common agricultural deep learning tasks, with standard benchmarks and pretrained models, as well the ability to generate synthetic data and annotations.
Sevalla

sevalla.com featured

Deploy and host your apps and databases, now with $50 credit! Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more!

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python synthetic-data discussion

Python synthetic-data related posts

Show HN: I built an AI dataset generator

5 projects | news.ycombinator.com | 26 Jun 2025
Ask HN: Is synthetic data generation practical outside academia?

2 projects | news.ycombinator.com | 6 Jun 2025
Hugging Face is looking for reasoning datasets beyond math, science and coding

2 projects | dev.to | 16 Apr 2025
1000 stars on GitHub feels like a Million likes on any other platform

1 project | dev.to | 21 Mar 2025
Gemini 50% cheaper with Batch API in Curator

1 project | dev.to | 14 Mar 2025
Launch HN: Trellis (YC W24) – AI-powered workflows for unstructured data

3 projects | news.ycombinator.com | 13 Aug 2024
SDMetrics: Library for evaluating synthetic data quality

1 project | news.ycombinator.com | 12 Apr 2024
A note from our sponsor - Civic Auth
www.civic.com | 31 Aug 2025

Drop Civic Auth into your Python backend with just a few lines of code. Email login, SSO, and route protection built-in. Minimal config. Works with FastAPI, Flask, or Django. Learn more →

Index

What are some of the best open-source synthetic-data projects in Python? This list will help you:

#	Project	Stars
1	Mimesis	4,612
2	Kiln	4,084
3	BlenderProc	3,191
4	SDV	3,139
5	distilabel	2,862
6	curator	1,487
7	CTGAN	1,443
8	intellagent	1,117
9	DataDreamer	1,051
10	bonito	788
11	pygraft	690
12	gretel-synthetics	656
13	mostlyai	622
14	Copulas	607
15	synthcity	582
16	augraphy	451
17	Robotics-Object-Pose-Estimation	325
18	zpy	310
19	DoppelGANger	306
20	SDGym	277
21	edsl	270
22	SDMetrics	246
23	AgML	233

Python synthetic-data

Top 23 Python synthetic-data Projects

Python synthetic-data discussion

Python synthetic-data related posts

Show HN: I built an AI dataset generator

Ask HN: Is synthetic data generation practical outside academia?

Hugging Face is looking for reasoning datasets beyond math, science and coding

1000 stars on GitHub feels like a Million likes on any other platform

Gemini 50% cheaper with Batch API in Curator

Launch HN: Trellis (YC W24) – AI-powered workflows for unstructured data

SDMetrics: Library for evaluating synthetic data quality

Index

Did you know that Python is the 2nd most popular programming language based on number of references?

Did you know that Python is
the 2nd most popular programming language
based on number of references?