Python data-augmentation

Open-source Python projects categorized as data-augmentation

Top 20 Python data-augmentation Projects

  • snorkel

    A system for quickly generating training data with weak supervision

    Project mention: [P] We are building a curated list of open source tooling for data-centric AI workflows, looking for contributions. | | 2023-03-03

    The paid product came out of an open source tool:

  • TextAttack

    TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

    Project mention: TextAttack VS OpenAttack - a user suggested alternative | | 2022-07-06
  • InfluxDB

    Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.

  • torchio

    Medical imaging toolkit for deep learning

  • eda_nlp

    Data augmentation for NLP, presented at EMNLP 2019

  • webdataset

    A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.

    Project mention: [D] Title: Best tools and frameworks for working with million-billion image datasets? | | 2023-03-26
  • fastdup

    fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.

    Project mention: Visualize your dataset using DINOv2 embedding | | 2023-05-02

    Visualizing your dataset (especially large ones) in a low-dimensional embedding space can tell you a lot about the patterns and clusters in your dataset.

    We recently release a notebook showing how you can visualize your dataset using DINOv2 models by running it on your CPU.

    Yes! No GPUs needed.

    We used it to find clusters of similar images, duplicates, and outliers in a subset of the LAION dataset

    Try it on your own dataset:

    Colab notebook -

    GitHub repo -

  • inltk

    Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • image_augmentor

    Data augmentation tool for images

  • ModelNet40-C

    Repo for "Benchmarking Robustness of 3D Point Cloud Recognition against Common Corruptions"

  • ContraD

    Code for the paper "Training GANs with Stronger Augmentations via Contrastive Discriminator" (ICLR 2021)

  • GAug

    AAAI'21: Data Augmentation for Graph Neural Networks

  • synthcity

    A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.

    Project mention: Benchmark synthetic tabular data generators using Syunthcity | | 2023-01-23


  • mutate

    A library to synthesize text datasets using Large Language Models (LLM)

  • genius

    💡GENIUS – generating text using sketches! A strong and general textual data augmentation tool.

    Project mention: Best language model for filling multiple related masks [D] | | 2023-01-09


  • question_extractor

    Generate question/answer training pairs out of raw text.

    Project mention: Show HN: Question Extractor: turn text into LLM finetuning data | | 2023-04-19
  • KitanaQA

    KitanaQA: Adversarial training and data augmentation for neural question-answering models (by searchableai)

  • targetran

    Python library for data augmentation in object detection or image classification model training

  • fastaugment

    A handy data augmentation toolkit for image classification put in a single efficient TensorFlow op.

  • MTR

    (NeurIPS 2023) The official implementation of the paper "Rethinking Data Augmentation for Tabular Data in Deep Learning" (by somaonishi)

    Project mention: Rethinking Data Augmentation for Tabular Data in Deep Learning | | 2023-05-18

    Tabular data is the most widely used data format in machine learning (ML). While tree-based methods outperform DL-based methods in supervised learning, recent literature reports that self-supervised learning with Transformer-based models outperforms tree-based methods. In the existing literature on self-supervised learning for tabular data, contrastive learning is the predominant method. In contrastive learning, data augmentation is important to generate different views. However, data augmentation for tabular data has been difficult due to the unique structure and high complexity of tabular data. In addition, three main components are proposed together in existing methods: model structure, self-supervised learning methods, and data augmentation. Therefore, previous works have compared the performance without comprehensively considering these components, and it is not clear how each component affects the actual performance. In this study, we focus on data augmentation to address these issues. We propose a novel data augmentation method, $\textbf{M}$ask $\textbf{T}$oken $\textbf{R}$eplacement ($\texttt{MTR}$), which replaces the mask token with a portion of each tokenized column; $\texttt{MTR}$ takes advantage of the properties of Transformer, which is becoming the predominant DL-based architecture for tabular data, to perform data augmentation for each column embedding. Through experiments with 13 diverse public datasets in both supervised and self-supervised learning scenarios, we show that $\texttt{MTR}$ achieves competitive performance against existing data augmentation methods and improves model performance. In addition, we discuss specific scenarios in which $\texttt{MTR}$ is most effective and identify the scope of its application. The code is available at

  • shutter

    Stochastic image generator for annotated synthetic datasets (by Rainelz)

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-05-18.

Python data-augmentation related posts


What are some of the best open-source data-augmentation projects in Python? This list will help you:

Project Stars
1 snorkel 5,500
2 TextAttack 2,359
3 torchio 1,732
4 eda_nlp 1,439
5 webdataset 1,365
6 fastdup 1,025
7 inltk 789
8 image_augmentor 407
9 ModelNet40-C 181
10 ContraD 180
11 GAug 159
12 synthcity 157
13 mutate 146
14 genius 143
15 question_extractor 111
16 KitanaQA 57
17 targetran 17
18 fastaugment 10
19 MTR 1
20 shutter 0
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives