The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 22 Python data-augmentation Projects
-
TextAttack
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP https://textattack.readthedocs.io/en/master/
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
webdataset
A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.
-
fastdup
fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.
-
inltk
Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
synthcity
A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.
-
ModelNet40-C
Repo for "Benchmarking Robustness of 3D Point Cloud Recognition against Common Corruptions" https://arxiv.org/abs/2201.12296
-
ContraD
Code for the paper "Training GANs with Stronger Augmentations via Contrastive Discriminator" (ICLR 2021)
-
KitanaQA
KitanaQA: Adversarial training and data augmentation for neural question-answering models (by searchableai)
-
targetran
Python library for data augmentation in object detection or image classification model training
-
fastaugment
A handy data augmentation toolkit for image classification put in a single efficient TensorFlow/PyTorch op.
-
MTR
The official implementation of the paper "Rethinking Data Augmentation for Tabular Data in Deep Learning" (by somaonishi)
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Preprocessing methods besides stop words, regular expressions, lemmatization and stemming for an NLP classification problem | /r/MLQuestions | 2023-06-09Could have a look at what's available in the augmentor here https://github.com/QData/TextAttack. I'm not experienced with NLP so I may be way off here
Project mention: How to use data stored in a (private) S3 Bucket for training? | /r/pytorch | 2023-07-21As an alternative, I've looked into using WebDataset, but couldn't figure out how to access data that is stored in a private bucket.
Visualizing your dataset (especially large ones) in a low-dimensional embedding space can tell you a lot about the patterns and clusters in your dataset.
We recently release a notebook showing how you can visualize your dataset using DINOv2 models by running it on your CPU.
Yes! No GPUs needed.
We used it to find clusters of similar images, duplicates, and outliers in a subset of the LAION dataset
Try it on your own dataset:
Colab notebook - https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/dinov2_notebook.ipynb
GitHub repo - https://github.com/visual-layer/fastdup
Project mention: Show HN: Question Extractor: turn text into LLM finetuning data | news.ycombinator.com | 2023-04-19
(I do plan on making my results open source here, but it's obviously still a work in progress)
Project mention: Rethinking Data Augmentation for Tabular Data in Deep Learning | /r/BotNewsPreprints | 2023-05-18Tabular data is the most widely used data format in machine learning (ML). While tree-based methods outperform DL-based methods in supervised learning, recent literature reports that self-supervised learning with Transformer-based models outperforms tree-based methods. In the existing literature on self-supervised learning for tabular data, contrastive learning is the predominant method. In contrastive learning, data augmentation is important to generate different views. However, data augmentation for tabular data has been difficult due to the unique structure and high complexity of tabular data. In addition, three main components are proposed together in existing methods: model structure, self-supervised learning methods, and data augmentation. Therefore, previous works have compared the performance without comprehensively considering these components, and it is not clear how each component affects the actual performance. In this study, we focus on data augmentation to address these issues. We propose a novel data augmentation method, $\textbf{M}$ask $\textbf{T}$oken $\textbf{R}$eplacement ($\texttt{MTR}$), which replaces the mask token with a portion of each tokenized column; $\texttt{MTR}$ takes advantage of the properties of Transformer, which is becoming the predominant DL-based architecture for tabular data, to perform data augmentation for each column embedding. Through experiments with 13 diverse public datasets in both supervised and self-supervised learning scenarios, we show that $\texttt{MTR}$ achieves competitive performance against existing data augmentation methods and improves model performance. In addition, we discuss specific scenarios in which $\texttt{MTR}$ is most effective and identify the scope of its application. The code is available at https://github.com/somaonishi/MTR/.
Python data-augmentation related posts
- How to use data stored in a (private) S3 Bucket for training?
- [D] Title: Best tools and frameworks for working with million-billion image datasets?
- [D] Training networks on extremely large datasets (10+TB)?
- Best language model for filling multiple related masks [D]
- Hi everyone, my first Reddit post, let me introduce the GENIUS model.
- [D] Efficiently loading videos in PyTorch without extracting frames
- New image augmentation library for TF Dataset + TPU
-
A note from our sponsor - WorkOS
workos.com | 19 Apr 2024
Index
What are some of the best open-source data-augmentation projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | snorkel | 5,701 |
2 | TextAttack | 2,744 |
3 | torchio | 1,949 |
4 | webdataset | 1,924 |
5 | eda_nlp | 1,536 |
6 | fastdup | 1,398 |
7 | inltk | 811 |
8 | image_augmentor | 434 |
9 | synthcity | 351 |
10 | ModelNet40-C | 200 |
11 | ContraD | 186 |
12 | question_extractor | 181 |
13 | GAug | 181 |
14 | genius | 175 |
15 | mutate | 149 |
16 | KitanaQA | 57 |
17 | vkit | 21 |
18 | targetran | 19 |
19 | fastaugment | 14 |
20 | degradr | 11 |
21 | MTR | 9 |
22 | shutter | 1 |