Python synthetic-dataset-generation

Open-source Python projects categorized as synthetic-dataset-generation

Top 11 Python synthetic-dataset-generation Projects

synthetic-dataset-generation
  1. distilabel

    Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

    Project mention: Distilabel is a framework for synthetic data and AI feedback | news.ycombinator.com | 2025-01-28
  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. AutoPrompt

    A framework for prompt tuning using Intent-based Prompt Calibration

  4. DataDreamer

    DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models.   🤖💤

    Project mention: FLaNK AI - 01 April 2024 | dev.to | 2024-04-01
  5. curator

    Synthetic data curation for post-training and structured data extraction (by bespokelabsai)

    Project mention: Gemini 50% cheaper with Batch API in Curator | dev.to | 2025-03-14

    Read more about other API batch processing offered by Curator for OpenAI, Anthropic and more here.

  6. bonito

    A lightweight library for generating synthetic instruction tuning datasets for your data without GPT. (by BatsResearch)

  7. pygraft

    Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Fingertips

  8. DoppelGANger

    [IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. VQASynth

    Compose multimodal datasets 🎹

  11. DeFMO

    [CVPR 2021] DeFMO: Deblurring and Shape Recovery of Fast Moving Objects

  12. dataformer

    Solving data for LLMs - Create quality synthetic datasets!

    Project mention: AIM Weekly for 07 Oct 2024 | dev.to | 2024-10-07

    🫶 Building Resilient AI Infrastructure: Deep Dive Zilliz Cloud's New Production-Ready Features 🙅 Contributing to Open Source 🛠️ Upcoming Data Engineering Best Practices for AI 📝 Building Scalable Image Retrieval 💫 NASA and IBM Weather Model 🙌 Improve Rag with Knowledge Graphs 🦾 Leader 📎 Evaluating RAG 🚙 Solid Data Curation 🤖 Sparse and Dense Embeddings 🍔 Cohere LLM University 📢 DataFormer for Synthetic Data 📢 PDF2Audio 📊 Screenpipe 📱 Vector DB Bencmarks 🛼 Extreme Quantization 📢 AI Powered Question & Answering 🐈‍⬛ Building LLMS Stanford Class 🌐 New Python Web UI 📊 Visualize RAG 🌐 Free Map Hosting 📊 Pipefunc 🖥️ The Pipe to extract 👽 New Audio Model 🧐 Easy Milvus Schema Generation 👽 Multimodal Models 72B 🌐 Fivetran + Milvus 🗣️ JSON Viewer 👽 ONNX Runtime GenAI 🚙 LLM Explorer 🦾 Interesting Computer Vision Techniques 📊 Build a model from embedding 🧩 Superchunk 👽 LLM Eval - Salesforce 🍔 Small AMD Model 🔥 Comfy UI 🔥 Molmo is a family of open vision-language models developed by the Allen Institute for AI. Molmo models are trained on PixMo

  13. discus

    A data-centric AI package for ML/AI. Get the best high-quality data for the best results. Discord: https://discord.gg/t6ADqBKrdZ

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python synthetic-dataset-generation discussion

Log in or Post with

Index

What are some of the best open-source synthetic-dataset-generation projects in Python? This list will help you:

# Project Stars
1 distilabel 2,552
2 AutoPrompt 2,419
3 DataDreamer 983
4 curator 961
5 bonito 749
6 pygraft 681
7 DoppelGANger 303
8 VQASynth 306
9 DeFMO 171
10 dataformer 145
11 discus 64

Sponsored
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai

Did you know that Python is
the 2nd most popular programming language
based on number of references?