Top 11 Python synthetic-dataset-generation Projects
-
distilabel
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
Project mention: Distilabel is a framework for synthetic data and AI feedback | news.ycombinator.com | 2025-01-28 -
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
-
-
Read more about other API batch processing offered by Curator for OpenAI, Anthropic and more here.
-
bonito
A lightweight library for generating synthetic instruction tuning datasets for your data without GPT. (by BatsResearch)
-
-
DoppelGANger
[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
-
🫶 Building Resilient AI Infrastructure: Deep Dive Zilliz Cloud's New Production-Ready Features 🙅 Contributing to Open Source 🛠️ Upcoming Data Engineering Best Practices for AI 📝 Building Scalable Image Retrieval 💫 NASA and IBM Weather Model 🙌 Improve Rag with Knowledge Graphs 🦾 Leader 📎 Evaluating RAG 🚙 Solid Data Curation 🤖 Sparse and Dense Embeddings 🍔 Cohere LLM University 📢 DataFormer for Synthetic Data 📢 PDF2Audio 📊 Screenpipe 📱 Vector DB Bencmarks 🛼 Extreme Quantization 📢 AI Powered Question & Answering 🐈⬛ Building LLMS Stanford Class 🌐 New Python Web UI 📊 Visualize RAG 🌐 Free Map Hosting 📊 Pipefunc 🖥️ The Pipe to extract 👽 New Audio Model 🧐 Easy Milvus Schema Generation 👽 Multimodal Models 72B 🌐 Fivetran + Milvus 🗣️ JSON Viewer 👽 ONNX Runtime GenAI 🚙 LLM Explorer 🦾 Interesting Computer Vision Techniques 📊 Build a model from embedding 🧩 Superchunk 👽 LLM Eval - Salesforce 🍔 Small AMD Model 🔥 Comfy UI 🔥 Molmo is a family of open vision-language models developed by the Allen Institute for AI. Molmo models are trained on PixMo
-
discus
A data-centric AI package for ML/AI. Get the best high-quality data for the best results. Discord: https://discord.gg/t6ADqBKrdZ
Python synthetic-dataset-generation discussion
Index
What are some of the best open-source synthetic-dataset-generation projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | distilabel | 2,552 |
2 | AutoPrompt | 2,419 |
3 | DataDreamer | 983 |
4 | curator | 961 |
5 | bonito | 749 |
6 | pygraft | 681 |
7 | DoppelGANger | 303 |
8 | VQASynth | 306 |
9 | DeFMO | 171 |
10 | dataformer | 145 |
11 | discus | 64 |