Python data-augmentation

Open-source Python projects categorized as data-augmentation

Top 22 Python data-augmentation Projects

  • snorkel

    A system for quickly generating training data with weak supervision

  • TextAttack

    TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP https://textattack.readthedocs.io/en/master/

  • Project mention: Preprocessing methods besides stop words, regular expressions, lemmatization and stemming for an NLP classification problem | /r/MLQuestions | 2023-06-09

    Could have a look at what's available in the augmentor here https://github.com/QData/TextAttack. I'm not experienced with NLP so I may be way off here

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • torchio

    Medical imaging toolkit for deep learning

  • webdataset

    A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.

  • Project mention: How to use data stored in a (private) S3 Bucket for training? | /r/pytorch | 2023-07-21

    As an alternative, I've looked into using WebDataset, but couldn't figure out how to access data that is stored in a private bucket.

  • eda_nlp

    Data augmentation for NLP, presented at EMNLP 2019

  • fastdup

    fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.

  • Project mention: Visualize your dataset using DINOv2 embedding | news.ycombinator.com | 2023-05-02

    Visualizing your dataset (especially large ones) in a low-dimensional embedding space can tell you a lot about the patterns and clusters in your dataset.

    We recently release a notebook showing how you can visualize your dataset using DINOv2 models by running it on your CPU.

    Yes! No GPUs needed.

    We used it to find clusters of similar images, duplicates, and outliers in a subset of the LAION dataset

    Try it on your own dataset:

    Colab notebook - https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/dinov2_notebook.ipynb

    GitHub repo - https://github.com/visual-layer/fastdup

  • inltk

    Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • image_augmentor

    Data augmentation tool for images

  • synthcity

    A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.

  • ModelNet40-C

    Repo for "Benchmarking Robustness of 3D Point Cloud Recognition against Common Corruptions" https://arxiv.org/abs/2201.12296

  • ContraD

    Code for the paper "Training GANs with Stronger Augmentations via Contrastive Discriminator" (ICLR 2021)

  • question_extractor

    Generate question/answer training pairs out of raw text.

  • Project mention: Show HN: Question Extractor: turn text into LLM finetuning data | news.ycombinator.com | 2023-04-19
  • GAug

    AAAI'21: Data Augmentation for Graph Neural Networks

  • genius

    💡GENIUS – generating text using sketches! A strong text generation & data augmentation tool.

  • mutate

    A library to synthesize text datasets using Large Language Models (LLM)

  • KitanaQA

    KitanaQA: Adversarial training and data augmentation for neural question-answering models (by searchableai)

  • vkit

    Boosting Document Intelligence

  • targetran

    Python library for data augmentation in object detection or image classification model training

  • fastaugment

    A handy data augmentation toolkit for image classification put in a single efficient TensorFlow/PyTorch op.

  • degradr

    Python library for realistically degrading images.

  • Project mention: How to generate realistic PSFs for camera lenses? | /r/Optics | 2023-09-07

    (I do plan on making my results open source here, but it's obviously still a work in progress)

  • MTR

    The official implementation of the paper "Rethinking Data Augmentation for Tabular Data in Deep Learning" (by somaonishi)

  • Project mention: Rethinking Data Augmentation for Tabular Data in Deep Learning | /r/BotNewsPreprints | 2023-05-18

    Tabular data is the most widely used data format in machine learning (ML). While tree-based methods outperform DL-based methods in supervised learning, recent literature reports that self-supervised learning with Transformer-based models outperforms tree-based methods. In the existing literature on self-supervised learning for tabular data, contrastive learning is the predominant method. In contrastive learning, data augmentation is important to generate different views. However, data augmentation for tabular data has been difficult due to the unique structure and high complexity of tabular data. In addition, three main components are proposed together in existing methods: model structure, self-supervised learning methods, and data augmentation. Therefore, previous works have compared the performance without comprehensively considering these components, and it is not clear how each component affects the actual performance. In this study, we focus on data augmentation to address these issues. We propose a novel data augmentation method, $\textbf{M}$ask $\textbf{T}$oken $\textbf{R}$eplacement ($\texttt{MTR}$), which replaces the mask token with a portion of each tokenized column; $\texttt{MTR}$ takes advantage of the properties of Transformer, which is becoming the predominant DL-based architecture for tabular data, to perform data augmentation for each column embedding. Through experiments with 13 diverse public datasets in both supervised and self-supervised learning scenarios, we show that $\texttt{MTR}$ achieves competitive performance against existing data augmentation methods and improves model performance. In addition, we discuss specific scenarios in which $\texttt{MTR}$ is most effective and identify the scope of its application. The code is available at https://github.com/somaonishi/MTR/.

  • shutter

    Stochastic image generator for annotated synthetic datasets (by Rainelz)

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-09-07.

Python data-augmentation related posts

Index

What are some of the best open-source data-augmentation projects in Python? This list will help you:

Project Stars
1 snorkel 5,701
2 TextAttack 2,744
3 torchio 1,949
4 webdataset 1,924
5 eda_nlp 1,536
6 fastdup 1,398
7 inltk 811
8 image_augmentor 434
9 synthcity 351
10 ModelNet40-C 200
11 ContraD 186
12 question_extractor 181
13 GAug 181
14 genius 175
15 mutate 149
16 KitanaQA 57
17 vkit 21
18 targetran 19
19 fastaugment 14
20 degradr 11
21 MTR 9
22 shutter 1
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com