InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →
Top 23 Python synthetic-data Projects
-
Mimesis
Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.
View the Project on GitHub
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
Kiln
The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.
Project mention: Show HN: Create your own finetuned AI model using Google Sheets | news.ycombinator.com | 2025-04-30What’s the thinking of spreadsheet first? Just making it super accessible for people who already have data?
I’m building a UI for fine tuning (and evals, and synthetic data gen) - https://github.com/Kiln-AI/Kiln - and went the custom UI route. From chatting with folks - most people don’t have datasets, and need help building them.
-
-
-
distilabel
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
Project mention: Distilabel is a framework for synthetic data and AI feedback | news.ycombinator.com | 2025-01-28 -
-
Project mention: Hugging Face is looking for reasoning datasets beyond math, science and coding | dev.to | 2025-04-16
Top 4 innovative uses of Curator, each get a $250 Amazon (or country-specific equivalent) gift card
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
intellagent
A framework for comprehensive diagnosis and optimization of agents using simulated, realistic synthetic interactions
Project mention: Making Sure AI Agents Play Nice: A Look at How We Evaluate Them | dev.to | 2025-05-01When it comes to evaluating conversational agents, there are some smart ways to do it. Take a framework like IntellAgent. It uses AI to test other AI! It's a three-step process designed to make testing more thorough and realistic than just having a person manually try things out.
-
-
bonito
A lightweight library for generating synthetic instruction tuning datasets for your data without GPT. (by BatsResearch)
-
-
gretel-synthetics
Synthetic data generators for structured and unstructured text, featuring differentially private learning.
-
-
synthcity
A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.
-
MOSTLY AI has open-sourced its powerful Synthetic Data SDK, enabling you to create privacy-preserving, AI-generated synthetic data directly from your existing datasets—all within your secure environments.
Key Features:
Broad Data Support: Handle mixed data types (categorical, numerical, geospatial, text), single/multi-table datasets & time-series data.
Multiple Model Types: Leverage TabularARGN (SOTA for tabular data), fine-tuned HuggingFace models, and efficient LSTM for text generation.
Advanced Training Options: CPU/GPU support, differential privacy, and real-time progress monitoring.
Automated Quality Assurance: Built-in fidelity & privacy metrics with detailed HTML reports for visual data analysis.
Flexible Sampling: Upsample data, generate conditionally, rebalance segments, impute context-aware values, ensure fairness, and control outputs via temperature adjustments.
Seamless Integration: Connect effortlessly to external databases & cloud storage with a fully permissive open-source license.
Check out the SDK on GitHub: https://github.com/mostly-ai/mostlyai
-
augraphy
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
⚡️ https://github.com/sparkfish/augraphy
-
Robotics-Object-Pose-Estimation
A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.
-
-
DoppelGANger
[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
-
-
edsl
Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.
Project mention: Python Library for Structured Data Extraction via LLM | news.ycombinator.com | 2024-08-14Hey thanks for noticing - here's the MIT licensed library it's based on: https://github.com/expectedparrot/edsl
-
-
AgML
AgML is a centralized framework for agricultural machine learning. AgML provides access to public agricultural datasets for common agricultural deep learning tasks, with standard benchmarks and pretrained models, as well the ability to generate synthetic data and annotations.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python synthetic-data discussion
Python synthetic-data related posts
-
Hugging Face is looking for reasoning datasets beyond math, science and coding
-
1000 stars on GitHub feels like a Million likes on any other platform
-
Gemini 50% cheaper with Batch API in Curator
-
Launch HN: Trellis (YC W24) – AI-powered workflows for unstructured data
-
SDMetrics: Library for evaluating synthetic data quality
-
Synthetic data generation for tabular data
-
Ctgan: Generating synthetic data in Python using GANs
-
A note from our sponsor - InfluxDB
www.influxdata.com | 24 May 2025
Index
What are some of the best open-source synthetic-data projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | Mimesis | 4,569 |
2 | Kiln | 3,486 |
3 | BlenderProc | 3,072 |
4 | SDV | 2,833 |
5 | distilabel | 2,709 |
6 | CTGAN | 1,393 |
7 | curator | 1,340 |
8 | intellagent | 1,041 |
9 | DataDreamer | 1,015 |
10 | bonito | 774 |
11 | pygraft | 684 |
12 | gretel-synthetics | 642 |
13 | Copulas | 590 |
14 | synthcity | 552 |
15 | mostlyai | 517 |
16 | augraphy | 415 |
17 | Robotics-Object-Pose-Estimation | 315 |
18 | zpy | 306 |
19 | DoppelGANger | 306 |
20 | SDGym | 274 |
21 | edsl | 242 |
22 | SDMetrics | 235 |
23 | AgML | 216 |