Python synthetic-data

Open-source Python projects categorized as synthetic-data

Top 23 Python synthetic-data Projects

synthetic-data
  1. Mimesis

    Mimesis is a robust data generator for Python that can produce a wide range of fake data in multiple languages.

    Project mention: Mimesis: The Fake Data Generator That Will Blow Your Mind! | dev.to | 2025-05-08

    View the Project on GitHub

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. Kiln

    The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

    Project mention: Show HN: Create your own finetuned AI model using Google Sheets | news.ycombinator.com | 2025-04-30

    What’s the thinking of spreadsheet first? Just making it super accessible for people who already have data?

    I’m building a UI for fine tuning (and evals, and synthetic data gen) - https://github.com/Kiln-AI/Kiln - and went the custom UI route. From chatting with folks - most people don’t have datasets, and need help building them.

  4. BlenderProc

    A procedural Blender pipeline for photorealistic training image generation

  5. SDV

    Synthetic data generation for tabular data

  6. distilabel

    Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

    Project mention: Distilabel is a framework for synthetic data and AI feedback | news.ycombinator.com | 2025-01-28
  7. CTGAN

    Conditional GAN for generating synthetic tabular data.

  8. curator

    Synthetic data curation for post-training and structured data extraction (by bespokelabsai)

    Project mention: Hugging Face is looking for reasoning datasets beyond math, science and coding | dev.to | 2025-04-16

    Top 4 innovative uses of Curator, each get a $250 Amazon (or country-specific equivalent) gift card

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. intellagent

    A framework for comprehensive diagnosis and optimization of agents using simulated, realistic synthetic interactions

    Project mention: Making Sure AI Agents Play Nice: A Look at How We Evaluate Them | dev.to | 2025-05-01

    When it comes to evaluating conversational agents, there are some smart ways to do it. Take a framework like IntellAgent. It uses AI to test other AI! It's a three-step process designed to make testing more thorough and realistic than just having a person manually try things out.

  11. DataDreamer

    DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models.   🤖💤

  12. bonito

    A lightweight library for generating synthetic instruction tuning datasets for your data without GPT. (by BatsResearch)

  13. pygraft

    Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Fingertips

  14. gretel-synthetics

    Synthetic data generators for structured and unstructured text, featuring differentially private learning.

  15. Copulas

    A library to model multivariate data using copulas.

  16. synthcity

    A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.

  17. mostlyai

    Synthetic Data SDK ✨

    Project mention: Open-Source Synthetic Data SDK | news.ycombinator.com | 2025-02-22

    MOSTLY AI has open-sourced its powerful Synthetic Data SDK, enabling you to create privacy-preserving, AI-generated synthetic data directly from your existing datasets—all within your secure environments.

    Key Features:

    Broad Data Support: Handle mixed data types (categorical, numerical, geospatial, text), single/multi-table datasets & time-series data.

    Multiple Model Types: Leverage TabularARGN (SOTA for tabular data), fine-tuned HuggingFace models, and efficient LSTM for text generation.

    Advanced Training Options: CPU/GPU support, differential privacy, and real-time progress monitoring.

    Automated Quality Assurance: Built-in fidelity & privacy metrics with detailed HTML reports for visual data analysis.

    Flexible Sampling: Upsample data, generate conditionally, rebalance segments, impute context-aware values, ensure fairness, and control outputs via temperature adjustments.

    Seamless Integration: Connect effortlessly to external databases & cloud storage with a fully permissive open-source license.

    Check out the SDK on GitHub: https://github.com/mostly-ai/mostlyai

  18. augraphy

    Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes

    Project mention: All Data and AI Weekly #182 - 24-March-2025 | dev.to | 2025-03-24

    ⚡️ https://github.com/sparkfish/augraphy

  19. Robotics-Object-Pose-Estimation

    A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.

  20. zpy

    Synthetic data for computer vision. An open source toolkit using Blender and Python.

  21. DoppelGANger

    [IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions

  22. SDGym

    Benchmarking synthetic data generation methods.

  23. edsl

    Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.

    Project mention: Python Library for Structured Data Extraction via LLM | news.ycombinator.com | 2024-08-14

    Hey thanks for noticing - here's the MIT licensed library it's based on: https://github.com/expectedparrot/edsl

  24. SDMetrics

    Metrics to evaluate quality and efficacy of synthetic datasets.

  25. AgML

    AgML is a centralized framework for agricultural machine learning. AgML provides access to public agricultural datasets for common agricultural deep learning tasks, with standard benchmarks and pretrained models, as well the ability to generate synthetic data and annotations.

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python synthetic-data discussion

Log in or Post with

Python synthetic-data related posts

  • Hugging Face is looking for reasoning datasets beyond math, science and coding

    2 projects | dev.to | 16 Apr 2025
  • 1000 stars on GitHub feels like a Million likes on any other platform

    1 project | dev.to | 21 Mar 2025
  • Gemini 50% cheaper with Batch API in Curator

    1 project | dev.to | 14 Mar 2025
  • Launch HN: Trellis (YC W24) – AI-powered workflows for unstructured data

    3 projects | news.ycombinator.com | 13 Aug 2024
  • SDMetrics: Library for evaluating synthetic data quality

    1 project | news.ycombinator.com | 12 Apr 2024
  • Synthetic data generation for tabular data

    2 projects | news.ycombinator.com | 27 Feb 2024
  • Ctgan: Generating synthetic data in Python using GANs

    1 project | news.ycombinator.com | 5 Feb 2024
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 24 May 2025
    InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →

Index

What are some of the best open-source synthetic-data projects in Python? This list will help you:

# Project Stars
1 Mimesis 4,569
2 Kiln 3,486
3 BlenderProc 3,072
4 SDV 2,833
5 distilabel 2,709
6 CTGAN 1,393
7 curator 1,340
8 intellagent 1,041
9 DataDreamer 1,015
10 bonito 774
11 pygraft 684
12 gretel-synthetics 642
13 Copulas 590
14 synthcity 552
15 mostlyai 517
16 augraphy 415
17 Robotics-Object-Pose-Estimation 315
18 zpy 306
19 DoppelGANger 306
20 SDGym 274
21 edsl 242
22 SDMetrics 235
23 AgML 216

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?