trankit
lemmatization-lists
trankit | lemmatization-lists | |
---|---|---|
1 | 3 | |
707 | 303 | |
- | - | |
5.7 | 0.0 | |
15 days ago | over 2 years ago | |
Python | ||
Apache License 2.0 | ODC Open Database License v1.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
trankit
-
Trankit v1.0.0 - An open-source Transformer-based Multilingual NLP Toolkit for 56 languages is out.
Trankit is written in Python and can be easily installed via pip. Our code and pretrained models are publicly available at: https://github.com/nlp-uoregon/trankit
lemmatization-lists
-
Ambiguous spellings
It's a bit of a massive undertaking maintaining such a data set so it's mostly taken from https://github.com/michmech/lemmatization-lists At the top of the file you'll see some additional I've added to deal with personal pronouns and numbers.
-
Is there a text list of words and their variations?
Another one to add to your list: https://github.com/michmech/lemmatization-lists
-
Trying to build a lemmatizer from scratch
One approach might be to take a lemmatization list, like the lemma-token lists at https://github.com/michmech/lemmatization-lists/, and compile it into a Finite State Transducer. The Helsinki FST package, for instance, has an hfst-strings2fst command to compile pairs of strings into a transducer. You might need to do some reformatting of the input first.
What are some alternatives?
spaCy - 💫 Industrial-strength Natural Language Processing (NLP) in Python
tldr-transformers - The "tl;dr" on a few notable transformer papers (pre-2022).
Stanza - Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
awesome-sentiment-analysis - Repository with all what is necessary for sentiment analysis and related areas
transformers - 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
thesaurus - Offline database of synonyms/thesaurus
argilla - Argilla is a collaboration platform for AI engineers and domain experts that require high-quality outputs, full data ownership, and overall efficiency.
Awesome-pytorch-list - A comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.
wiktextract - Wiktionary dump file parser and multilingual data extractor
flair - A very simple framework for state-of-the-art Natural Language Processing (NLP)
Sentimentanalysis - Language independent sentiment analysis
quantulum3 - Library for unit extraction - fork of quantulum for python3