Python synthetic-data

Open-source Python projects categorized as synthetic-data

Top 23 Python synthetic-data Projects

synthetic-data
  1. Mimesis

    Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.

    Project mention: Mimesis: The Fake Data Generator That Will Blow Your Mind! | dev.to | 2025-05-08

    View the Project on GitHub

  2. Civic Auth

    Simple auth for Python backends. Drop Civic Auth into your Python backend with just a few lines of code. Email login, SSO, and route protection built-in. Minimal config. Works with FastAPI, Flask, or Django.

    Civic Auth logo
  3. Kiln

    The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

    Project mention: Show HN: Kiln – AI Boilerplate with Evals, Fine-Tuning, Synthetic Data, and Git | news.ycombinator.com | 2025-07-28

    I noticed there weren't boilerplates for AI projects like there were for web apps, so I built one. Same idea - everything you need to get a project up and running quickly. However, instead of web-framework/CSS/DB, it's tools for AI projects: evals, synthetic data gen, fine-tuning, and more.

    Kiln is a free, open tool that gives you everything most AI projects need in one integrated package:

    - Eval system: including LLM-as-judge evals, eval data generation, human baselines

    - Fine-tuning: proxy to many fine-tuning providers like Fireworks/Together/OpenAI/Unsloth

    - Synthetic data generation: deeply integrated into evals and fine-tuning

    - Model routing: 12 providers including Ollama, OpenRouter, and more

    - Git-based collaboration: projects are designed to be synced through your own git server

    The key insight is that these tools work much better when they're integrated. For example, the synthetic data generator knows whether you're creating data for evals vs. fine-tuning (which have very different data needs), and evals can automatically test different prompt/model/fine-tune combinations.

    It runs entirely locally - your project data stays in local files, and you control your own git repos. No external services required (though it integrates with them if you want).

    Main project GitHub: https://github.com/Kiln-AI/Kiln

    Demo GitHub where I use it to build a 'natural language to ffmpeg command' demo with evals, fine-tunes, and synthetic data (including demo video): https://github.com/Kiln-AI/demos/blob/main/end_to_end_projec...

  4. BlenderProc

    A procedural Blender pipeline for photorealistic training image generation

  5. SDV

    Synthetic data generation for tabular data

  6. distilabel

    Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

    Project mention: Distilabel is a framework for synthetic data and AI feedback | news.ycombinator.com | 2025-01-28
  7. curator

    Synthetic data curation for post-training and structured data extraction (by bespokelabsai)

    Project mention: Ask HN: Is synthetic data generation practical outside academia? | news.ycombinator.com | 2025-06-06

    https://github.com/bespokelabsai/curator

    But it still feels very research-oriented. I haven’t found many examples of these pipelines running in real-world products.

    I’m curious:

    1. Who is using synthetic-data pipelines in production today?

    2. What tasks does it actually improve. E.g. fine-tuning smaller models for specific tasks?

    Any real-world stories, pointers, or further reading would be hugely appreciated. Thanks!

  8. CTGAN

    Conditional GAN for generating synthetic tabular data.

  9. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  10. intellagent

    A framework for comprehensive diagnosis and optimization of agents using simulated, realistic synthetic interactions

    Project mention: Making Sure AI Agents Play Nice: A Look at How We Evaluate Them | dev.to | 2025-05-01

    When it comes to evaluating conversational agents, there are some smart ways to do it. Take a framework like IntellAgent. It uses AI to test other AI! It's a three-step process designed to make testing more thorough and realistic than just having a person manually try things out.

  11. DataDreamer

    DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models.   🤖💤

    Project mention: Show HN: I built an AI dataset generator | news.ycombinator.com | 2025-06-26

    this is sometimes called distillation. here is a robust example from some upenn students: https://datadreamer.dev/

  12. bonito

    A lightweight library for generating synthetic instruction tuning datasets for your data without GPT. (by BatsResearch)

  13. pygraft

    Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Fingertips

  14. gretel-synthetics

    Synthetic data generators for structured and unstructured text, featuring differentially private learning.

  15. mostlyai

    Synthetic Data SDK ✨

    Project mention: Open-Source Synthetic Data SDK | news.ycombinator.com | 2025-02-22

    MOSTLY AI has open-sourced its powerful Synthetic Data SDK, enabling you to create privacy-preserving, AI-generated synthetic data directly from your existing datasets—all within your secure environments.

    Key Features:

    Broad Data Support: Handle mixed data types (categorical, numerical, geospatial, text), single/multi-table datasets & time-series data.

    Multiple Model Types: Leverage TabularARGN (SOTA for tabular data), fine-tuned HuggingFace models, and efficient LSTM for text generation.

    Advanced Training Options: CPU/GPU support, differential privacy, and real-time progress monitoring.

    Automated Quality Assurance: Built-in fidelity & privacy metrics with detailed HTML reports for visual data analysis.

    Flexible Sampling: Upsample data, generate conditionally, rebalance segments, impute context-aware values, ensure fairness, and control outputs via temperature adjustments.

    Seamless Integration: Connect effortlessly to external databases & cloud storage with a fully permissive open-source license.

    Check out the SDK on GitHub: https://github.com/mostly-ai/mostlyai

  16. Copulas

    A library to model multivariate data using copulas.

  17. synthcity

    A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.

  18. augraphy

    Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes

    Project mention: All Data and AI Weekly #182 - 24-March-2025 | dev.to | 2025-03-24

    ⚡️ https://github.com/sparkfish/augraphy

  19. Robotics-Object-Pose-Estimation

    A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.

  20. zpy

    Synthetic data for computer vision. An open source toolkit using Blender and Python.

  21. DoppelGANger

    [IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions

  22. SDGym

    Benchmarking synthetic data generation methods.

  23. edsl

    Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.

  24. SDMetrics

    Metrics to evaluate quality and efficacy of synthetic datasets.

  25. AgML

    AgML is a centralized framework for agricultural machine learning. AgML provides access to public agricultural datasets for common agricultural deep learning tasks, with standard benchmarks and pretrained models, as well the ability to generate synthetic data and annotations.

  26. Sevalla

    Deploy and host your apps and databases, now with $50 credit! Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more!

    Sevalla logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python synthetic-data discussion

Log in or Post with

Python synthetic-data related posts

  • Show HN: I built an AI dataset generator

    5 projects | news.ycombinator.com | 26 Jun 2025
  • Ask HN: Is synthetic data generation practical outside academia?

    2 projects | news.ycombinator.com | 6 Jun 2025
  • Hugging Face is looking for reasoning datasets beyond math, science and coding

    2 projects | dev.to | 16 Apr 2025
  • 1000 stars on GitHub feels like a Million likes on any other platform

    1 project | dev.to | 21 Mar 2025
  • Gemini 50% cheaper with Batch API in Curator

    1 project | dev.to | 14 Mar 2025
  • Launch HN: Trellis (YC W24) – AI-powered workflows for unstructured data

    3 projects | news.ycombinator.com | 13 Aug 2024
  • SDMetrics: Library for evaluating synthetic data quality

    1 project | news.ycombinator.com | 12 Apr 2024
  • A note from our sponsor - Civic Auth
    www.civic.com | 31 Aug 2025
    Drop Civic Auth into your Python backend with just a few lines of code. Email login, SSO, and route protection built-in. Minimal config. Works with FastAPI, Flask, or Django. Learn more →

Index

What are some of the best open-source synthetic-data projects in Python? This list will help you:

# Project Stars
1 Mimesis 4,612
2 Kiln 4,084
3 BlenderProc 3,191
4 SDV 3,139
5 distilabel 2,862
6 curator 1,487
7 CTGAN 1,443
8 intellagent 1,117
9 DataDreamer 1,051
10 bonito 788
11 pygraft 690
12 gretel-synthetics 656
13 mostlyai 622
14 Copulas 607
15 synthcity 582
16 augraphy 451
17 Robotics-Object-Pose-Estimation 325
18 zpy 310
19 DoppelGANger 306
20 SDGym 277
21 edsl 270
22 SDMetrics 246
23 AgML 233

Sponsored
Simple auth for Python backends
Drop Civic Auth into your Python backend with just a few lines of code. Email login, SSO, and route protection built-in. Minimal config. Works with FastAPI, Flask, or Django.
www.civic.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?