Top 22 Python data-augmentation Projects

snorkel

5 5,701 5.5 Python

A system for quickly generating training data with weak supervision
TextAttack

3 2,744 8.4 Python

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP https://textattack.readthedocs.io/en/master/

Project mention: Preprocessing methods besides stop words, regular expressions, lemmatization and stemming for an NLP classification problem | /r/MLQuestions | 2023-06-09

Could have a look at what's available in the augmentor here https://github.com/QData/TextAttack. I'm not experienced with NLP so I may be way off here

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
torchio

2 1,949 8.2 Python

Medical imaging toolkit for deep learning
webdataset

7 1,924 8.9 Python

A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.

Project mention: How to use data stored in a (private) S3 Bucket for training? | /r/pytorch | 2023-07-21

As an alternative, I've looked into using WebDataset, but couldn't figure out how to access data that is stored in a private bucket.

eda_nlp

1 1,536 0.0 Python

Data augmentation for NLP, presented at EMNLP 2019
fastdup

18 1,398 9.4 Python

fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.

Project mention: Visualize your dataset using DINOv2 embedding | news.ycombinator.com | 2023-05-02

Visualizing your dataset (especially large ones) in a low-dimensional embedding space can tell you a lot about the patterns and clusters in your dataset.
We recently release a notebook showing how you can visualize your dataset using DINOv2 models by running it on your CPU.
Yes! No GPUs needed.
We used it to find clusters of similar images, duplicates, and outliers in a subset of the LAION dataset
Try it on your own dataset:
Colab notebook - https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/dinov2_notebook.ipynb
GitHub repo - https://github.com/visual-layer/fastdup

inltk

1 811 0.0 Python

Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
image_augmentor

1 434 0.0 Python

Data augmentation tool for images
synthcity

4 351 7.3 Python

A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.
ModelNet40-C

2 200 0.0 Python

Repo for "Benchmarking Robustness of 3D Point Cloud Recognition against Common Corruptions" https://arxiv.org/abs/2201.12296
ContraD

1 186 0.0 Python

Code for the paper "Training GANs with Stronger Augmentations via Contrastive Discriminator" (ICLR 2021)
question_extractor

1 181 7.3 Python

Generate question/answer training pairs out of raw text.

Project mention: Show HN: Question Extractor: turn text into LLM finetuning data | news.ycombinator.com | 2023-04-19

GAug

1 181 0.0 Python

AAAI'21: Data Augmentation for Graph Neural Networks
genius

2 175 10.0 Python

💡GENIUS – generating text using sketches! A strong text generation & data augmentation tool.
mutate

1 149 0.0 Python

A library to synthesize text datasets using Large Language Models (LLM)
KitanaQA

1 57 0.0 Python

KitanaQA: Adversarial training and data augmentation for neural question-answering models (by searchableai)
vkit

1 21 1.9 Python

Boosting Document Intelligence
targetran

2 19 4.5 Python

Python library for data augmentation in object detection or image classification model training
fastaugment

2 14 5.3 Python

A handy data augmentation toolkit for image classification put in a single efficient TensorFlow/PyTorch op.
degradr

1 11 8.0 Python

Python library for realistically degrading images.

Project mention: How to generate realistic PSFs for camera lenses? | /r/Optics | 2023-09-07

(I do plan on making my results open source here, but it's obviously still a work in progress)

MTR

1 9 4.9 Python

The official implementation of the paper "Rethinking Data Augmentation for Tabular Data in Deep Learning" (by somaonishi)

Project mention: Rethinking Data Augmentation for Tabular Data in Deep Learning | /r/BotNewsPreprints | 2023-05-18

Tabular data is the most widely used data format in machine learning (ML). While tree-based methods outperform DL-based methods in supervised learning, recent literature reports that self-supervised learning with Transformer-based models outperforms tree-based methods. In the existing literature on self-supervised learning for tabular data, contrastive learning is the predominant method. In contrastive learning, data augmentation is important to generate different views. However, data augmentation for tabular data has been difficult due to the unique structure and high complexity of tabular data. In addition, three main components are proposed together in existing methods: model structure, self-supervised learning methods, and data augmentation. Therefore, previous works have compared the performance without comprehensively considering these components, and it is not clear how each component affects the actual performance. In this study, we focus on data augmentation to address these issues. We propose a novel data augmentation method, $\textbf{M}$ask $\textbf{T}$oken $\textbf{R}$eplacement ($\texttt{MTR}$), which replaces the mask token with a portion of each tokenized column; $\texttt{MTR}$ takes advantage of the properties of Transformer, which is becoming the predominant DL-based architecture for tabular data, to perform data augmentation for each column embedding. Through experiments with 13 diverse public datasets in both supervised and self-supervised learning scenarios, we show that $\texttt{MTR}$ achieves competitive performance against existing data augmentation methods and improves model performance. In addition, we discuss specific scenarios in which $\texttt{MTR}$ is most effective and identify the scope of its application. The code is available at https://github.com/somaonishi/MTR/.

shutter

1 1 0.0 Python

Stochastic image generator for annotated synthetic datasets (by Rainelz)
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-09-07.

Python data-augmentation related posts

How to use data stored in a (private) S3 Bucket for training?
1 project | /r/pytorch | 21 Jul 2023
[D] Title: Best tools and frameworks for working with million-billion image datasets?
1 project | /r/MachineLearning | 26 Mar 2023
[D] Training networks on extremely large datasets (10+TB)?
1 project | /r/MachineLearning | 17 Feb 2023
Best language model for filling multiple related masks [D]
1 project | /r/MachineLearning | 9 Jan 2023
Hi everyone, my first Reddit post, let me introduce the GENIUS model.
2 projects | /r/deeplearning | 23 Nov 2022
[D] Efficiently loading videos in PyTorch without extracting frames
5 projects | /r/MachineLearning | 26 Oct 2021
New image augmentation library for TF Dataset + TPU
1 project | /r/tensorflow | 14 Sep 2021
A note from our sponsor - WorkOS
workos.com | 19 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source data-augmentation projects in Python? This list will help you:

	Project	Stars
1	snorkel	5,701
2	TextAttack	2,744
3	torchio	1,949
4	webdataset	1,924
5	eda_nlp	1,536
6	fastdup	1,398
7	inltk	811
8	image_augmentor	434
9	synthcity	351
10	ModelNet40-C	200
11	ContraD	186
12	question_extractor	181
13	GAug	181
14	genius	175
15	mutate	149
16	KitanaQA	57
17	vkit	21
18	targetran	19
19	fastaugment	14
20	degradr	11
21	MTR	9
22	shutter	1