The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 23 Python synthetic-data Projects
-
Mimesis
Mimesis is a powerful Python library that empowers developers to generate massive amounts of synthetic data efficiently.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
gretel-synthetics
Synthetic data generators for structured and unstructured text, featuring differentially private learning.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
bonito
A lightweight library for generating synthetic instruction tuning datasets for your data without GPT. (by BatsResearch)
-
synthcity
A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.
-
DoppelGANger
[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
-
Robotics-Object-Pose-Estimation
A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.
-
AgML
AgML is a centralized framework for agricultural machine learning. AgML provides access to public agricultural datasets for common agricultural deep learning tasks, with standard benchmarks and pretrained models, as well the ability to generate synthetic data and annotations.
-
FAST-RIR
This is the official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment.
-
discus
A data-centric AI package for ML/AI. Get the best high-quality data for the best results. Discord: https://discord.gg/t6ADqBKrdZ
-
Main
Main folder. Material related to my books on synthetic data and generative AI. Also contains documents blending components from several folders, or covering topics spanning across multiple folders.. (by VincentGranville)
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Can someone help me understand the licensing of this?
https://github.com/sdv-dev/SDV/blob/main/LICENSE
It was MIT licensed up until 2022 where it was changed to what it is now, where they say that it will become MIT again 4 years after release... but is that from when the license was changed or the first release of the software in GitHub?
Project mention: Ctgan: Generating synthetic data in Python using GANs | news.ycombinator.com | 2024-02-05
Project mention: PyGraft: Configurable Generation of Schemas and Knowledge Graphs | news.ycombinator.com | 2023-09-13
Project mention: Ask HN: If we train an LLM with “data” instead of “language” tokens | news.ycombinator.com | 2023-08-16Hey there! Co-founder of Gretel.ai here, and I think I can provide some insights on this topic.
Firstly, the concept you're hinting at is not purely traditional ML. In traditional machine learning, we often prioritize feature extraction and engineering specific to a given problem space before training.
What you're describing and what we've been working on at Gretel.ai, is leveraging the power of models like Large Language Models (LLMs) to understand and extrapolate from vast amounts of diverse data without the need for time-consuming feature engineering. Here's a link to our open-source library https://github.com/gretelai/gretel-synthetics for synthetic data generation (currently supporting GAN and RNN-based language models), and also our recent announcement around a Tabular LLM we're training to help people build with data https://gretel.ai/tabular-llm
A few areas where we've found tabular or Large Data Models to be really useful are:
Project mention: SDMetrics: Library for evaluating synthetic data quality | news.ycombinator.com | 2024-04-12
Project mention: Access to public agricultural datasets for agricultural deep learning tasks | news.ycombinator.com | 2023-11-05
Project mention: an open source package helping developers generate data for LLMs | /r/mlops | 2023-08-02
Python synthetic-data related posts
- SDMetrics: Library for evaluating synthetic data quality
- Synthetic data generation for tabular data
- Ctgan: Generating synthetic data in Python using GANs
- Phibrarian Alpha - the first model checkpoint from SciPhi's Mistral-7b
- With LLMs we can create a fully open-source Library of Alexandria.
- Textbook was authored with an AI pipeline
- Ask HN: If we train an LLM with “data” instead of “language” tokens
-
A note from our sponsor - WorkOS
workos.com | 19 Apr 2024
Index
What are some of the best open-source synthetic-data projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | Mimesis | 4,300 |
2 | BlenderProc | 2,536 |
3 | SDV | 2,105 |
4 | CTGAN | 1,130 |
5 | pygraft | 639 |
6 | DataDreamer | 630 |
7 | gretel-synthetics | 530 |
8 | Copulas | 501 |
9 | bonito | 463 |
10 | synthcity | 351 |
11 | zpy | 288 |
12 | DoppelGANger | 275 |
13 | Robotics-Object-Pose-Estimation | 263 |
14 | SDGym | 241 |
15 | SDMetrics | 189 |
16 | AgML | 150 |
17 | FAST-RIR | 135 |
18 | DeepEcho | 87 |
19 | discus | 62 |
20 | anonymeter | 58 |
21 | Main | 57 |
22 | tofu | 51 |
23 | gretel-python-client | 43 |