A Visual Survey of Data Augmentation in NLP

This page summarizes the projects mentioned and recommended in the original post on dev.to

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • nlpaug

    Data augmentation for NLP

  • Spelling error injection In this method, we add spelling errors to some random word in the sentence. These spelling errors can be added programmatically or using a mapping of common spelling errors such as this list for English.

  • uda

    Unsupervised Data Augmentation (UDA)

  • The words that replaces the original word are chosen by calculating TF-IDF scores of words over the whole document and taking the lowest ones. You can refer to the code implementation for this in the original paper here.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • contractions

    Fixes contractions such as `you're` to `you are`

  • In the paper, he gives an example of transforming verbal forms from contraction to expansion and vice versa. We can generate augmented texts by applying this. Since the transformation should not change the meaning of the sentence, we can see that this can fail in case of expanding ambiguous verbal forms like: To resolve this, the paper proposes that we allow ambiguous contractions but skip ambiguous expansion. You can find a list of contractions for the English language here. For expansion, you can use the contractions library in Python.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts