Drop Civic Auth into your Python backend with just a few lines of code. Email login, SSO, and route protection built-in. Minimal config. Works with FastAPI, Flask, or Django. Learn more →
Top 23 Python synthetic-data Projects
-
Mimesis
Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.
View the Project on GitHub
-
Civic Auth
Simple auth for Python backends. Drop Civic Auth into your Python backend with just a few lines of code. Email login, SSO, and route protection built-in. Minimal config. Works with FastAPI, Flask, or Django.
-
Kiln
The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.
Project mention: Show HN: Kiln – AI Boilerplate with Evals, Fine-Tuning, Synthetic Data, and Git | news.ycombinator.com | 2025-07-28I noticed there weren't boilerplates for AI projects like there were for web apps, so I built one. Same idea - everything you need to get a project up and running quickly. However, instead of web-framework/CSS/DB, it's tools for AI projects: evals, synthetic data gen, fine-tuning, and more.
Kiln is a free, open tool that gives you everything most AI projects need in one integrated package:
- Eval system: including LLM-as-judge evals, eval data generation, human baselines
- Fine-tuning: proxy to many fine-tuning providers like Fireworks/Together/OpenAI/Unsloth
- Synthetic data generation: deeply integrated into evals and fine-tuning
- Model routing: 12 providers including Ollama, OpenRouter, and more
- Git-based collaboration: projects are designed to be synced through your own git server
The key insight is that these tools work much better when they're integrated. For example, the synthetic data generator knows whether you're creating data for evals vs. fine-tuning (which have very different data needs), and evals can automatically test different prompt/model/fine-tune combinations.
It runs entirely locally - your project data stays in local files, and you control your own git repos. No external services required (though it integrates with them if you want).
Main project GitHub: https://github.com/Kiln-AI/Kiln
Demo GitHub where I use it to build a 'natural language to ffmpeg command' demo with evals, fine-tunes, and synthetic data (including demo video): https://github.com/Kiln-AI/demos/blob/main/end_to_end_projec...
-
-
-
distilabel
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
Project mention: Distilabel is a framework for synthetic data and AI feedback | news.ycombinator.com | 2025-01-28 -
Project mention: Ask HN: Is synthetic data generation practical outside academia? | news.ycombinator.com | 2025-06-06
https://github.com/bespokelabsai/curator
But it still feels very research-oriented. I haven’t found many examples of these pipelines running in real-world products.
I’m curious:
1. Who is using synthetic-data pipelines in production today?
2. What tasks does it actually improve. E.g. fine-tuning smaller models for specific tasks?
Any real-world stories, pointers, or further reading would be hugely appreciated. Thanks!
-
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
intellagent
A framework for comprehensive diagnosis and optimization of agents using simulated, realistic synthetic interactions
Project mention: Making Sure AI Agents Play Nice: A Look at How We Evaluate Them | dev.to | 2025-05-01When it comes to evaluating conversational agents, there are some smart ways to do it. Take a framework like IntellAgent. It uses AI to test other AI! It's a three-step process designed to make testing more thorough and realistic than just having a person manually try things out.
-
this is sometimes called distillation. here is a robust example from some upenn students: https://datadreamer.dev/
-
bonito
A lightweight library for generating synthetic instruction tuning datasets for your data without GPT. (by BatsResearch)
-
-
gretel-synthetics
Synthetic data generators for structured and unstructured text, featuring differentially private learning.
-
MOSTLY AI has open-sourced its powerful Synthetic Data SDK, enabling you to create privacy-preserving, AI-generated synthetic data directly from your existing datasets—all within your secure environments.
Key Features:
Broad Data Support: Handle mixed data types (categorical, numerical, geospatial, text), single/multi-table datasets & time-series data.
Multiple Model Types: Leverage TabularARGN (SOTA for tabular data), fine-tuned HuggingFace models, and efficient LSTM for text generation.
Advanced Training Options: CPU/GPU support, differential privacy, and real-time progress monitoring.
Automated Quality Assurance: Built-in fidelity & privacy metrics with detailed HTML reports for visual data analysis.
Flexible Sampling: Upsample data, generate conditionally, rebalance segments, impute context-aware values, ensure fairness, and control outputs via temperature adjustments.
Seamless Integration: Connect effortlessly to external databases & cloud storage with a fully permissive open-source license.
Check out the SDK on GitHub: https://github.com/mostly-ai/mostlyai
-
-
synthcity
A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.
-
augraphy
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
⚡️ https://github.com/sparkfish/augraphy
-
Robotics-Object-Pose-Estimation
A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.
-
-
DoppelGANger
[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
-
-
edsl
Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.
-
-
AgML
AgML is a centralized framework for agricultural machine learning. AgML provides access to public agricultural datasets for common agricultural deep learning tasks, with standard benchmarks and pretrained models, as well the ability to generate synthetic data and annotations.
-
Sevalla
Deploy and host your apps and databases, now with $50 credit! Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more!
Python synthetic-data discussion
Python synthetic-data related posts
-
Show HN: I built an AI dataset generator
-
Ask HN: Is synthetic data generation practical outside academia?
-
Hugging Face is looking for reasoning datasets beyond math, science and coding
-
1000 stars on GitHub feels like a Million likes on any other platform
-
Gemini 50% cheaper with Batch API in Curator
-
Launch HN: Trellis (YC W24) – AI-powered workflows for unstructured data
-
SDMetrics: Library for evaluating synthetic data quality
-
A note from our sponsor - Civic Auth
www.civic.com | 31 Aug 2025
Index
What are some of the best open-source synthetic-data projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | Mimesis | 4,612 |
2 | Kiln | 4,084 |
3 | BlenderProc | 3,191 |
4 | SDV | 3,139 |
5 | distilabel | 2,862 |
6 | curator | 1,487 |
7 | CTGAN | 1,443 |
8 | intellagent | 1,117 |
9 | DataDreamer | 1,051 |
10 | bonito | 788 |
11 | pygraft | 690 |
12 | gretel-synthetics | 656 |
13 | mostlyai | 622 |
14 | Copulas | 607 |
15 | synthcity | 582 |
16 | augraphy | 451 |
17 | Robotics-Object-Pose-Estimation | 325 |
18 | zpy | 310 |
19 | DoppelGANger | 306 |
20 | SDGym | 277 |
21 | edsl | 270 |
22 | SDMetrics | 246 |
23 | AgML | 233 |