Top 23 Python NLP Projects
-
transformers
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
Project mention: Retrieval Augmented Generation with Huggingface Transformers and Ray | reddit.com/r/deeplearning | 2021-02-10Improving the scalability RAG distributed fine tuning
-
spaCy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Project mention: Ask HN: What is your production ML stack like? (2021) | news.ycombinator.com | 2021-02-08Here's the ML stack I have been using for my last project:
- Doing NLP with spaCy (https://spacy.io/) as I consider it to be the most production ready framework for NLP
- Annotating datasets with Prodigy (https://prodi.gy/), a paid tool made by the spaCy team
- Deploying the trained spaCy models onto NLP Cloud (https://nlpcloud.io)
- Use the models through the NLP Cloud API in production and enrich my Django application out of it
-
Scout
Get performance insights in less than 4 minutes. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
-
gensim
Topic Modelling for Humans
Project mention: Koan: A word2vec negative sampling implementation with correct CBOW update | news.ycombinator.com | 2021-01-02Apparently it did: https://github.com/RaRe-Technologies/gensim/issues/1873
-
allennlp
An open-source NLP research library, built on PyTorch.
-
NLTK
NLTK Source
-
bert-as-service
Mapping a variable-length sentence to a fixed-length vector using BERT model
Project mention: Needed 100% to pass a safety quiz, need to wait a week to retake | reddit.com/r/mildlyinfuriating | 2021-01-12You joke but
-
datasets
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools
This article shows how txtai can index and search with Hugging Face's Datasets library. Datasets opens access to a large and growing list of publicly available datasets. Datasets has functionality to select, transform and filter data stored in each dataset.
-
Stanza
Official Stanford NLP Python Library for Many Human Languages
-
mycroft-core
Mycroft Core, the Mycroft Artificial Intelligence platform.
Mycroft claims to be an open source customizable voice assistant. I've never used it so do your own research. If what they say is true, it sounds like it would work.
-
best-of-ml-python
🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.
Project mention: best-of-python: A ranked list of awesome Python libraries and tools | reddit.com/r/Python | 2021-01-14Here ya go: https://github.com/ml-tooling/best-of-ml-python/pull/47
-
GPT2-Chinese
Chinese version of GPT2 training code, using BERT tokenizer.
-
sumy
Module for automatic summarization of text documents and HTML pages.
-
jina
An easier way to build neural search on the cloud
Project mention: Show HN: Jina – Open-source AI framework to build search for anything, fast | news.ycombinator.com | 2021-02-10 -
thinc
🔮 A refreshing functional take on deep learning, compatible with your favorite libraries
Project mention: thinc - A refreshing functional take on deep learning, compatible with your favorite libraries | reddit.com/r/datascience | 2021-02-17 -
lingvo
Lingvo
Yes, there are really good open source speech to text tools (automatic speech recognition (ASR) is the common name for that).
Kaldi (https://kaldi-asr.org/) is probably the most well known, and supports hybrid NN-HMM and lattice-free MMI models. Kaldi is used by many people both in research and in production.
Lingvo (https://github.com/tensorflow/lingvo) is the open source version of Google speech recognition toolkit, with support mostly for end-to-end models.
ESPNet (https://github.com/espnet/espnet) is good and well known for end-to-end models as well.
RASR (https://github.com/rwth-i6/rasr) + RETURNN (https://github.com/rwth-i6/returnn) are very good as well, both for end-to-end models and hybrid NN-HMM, but they are for non-commercial applications only (or you need a commercial licence) (disclaimer: I work at the university chair which develops these frameworks).
-
PyTorch-NLP
Basic Utilities for PyTorch Natural Language Processing (NLP)
-
aeneas
aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
Project mention: Show HN: A retrainable subtitle synchronizer you can now build your own | news.ycombinator.com | 2021-01-31here's another solution: https://github.com/readbeyond/aeneas
-
uda
Unsupervised Data Augmentation (UDA)
The words that replaces the original word are chosen by calculating TF-IDF scores of words over the whole document and taking the lowest ones. You can refer to the code implementation for this in the original paper here.
-
textacy
NLP, before and after spaCy
-
haystack
:mag: End-to-end Python framework for building natural language search interfaces to data. Leverages Transformers and the State-of-the-Art of NLP. Supports DPR, Elasticsearch, HuggingFace’s Modelhub, and much more! (by deepset-ai)
-
magnitude
A fast, efficient universal vector embedding utility package.
General language models from pymagnitude
-
TextAttack
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP
Libraries like nlpaug and textattack provide simple and consistent API to apply the above NLP data augmentation methods in Python. They are framework agnostic and can be easily integrated into your pipeline.
-
jiant
jiant is an NLP toolkit
Project mention: Looking for a code base to implement multi-task learning in NLP | reddit.com/r/LanguageTechnology | 2021-02-22Jiant should fulfill 1, 2, 4 and 5.
Index
What are some of the best open-source NLP projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | transformers | 41,393 |
2 | spaCy | 19,619 |
3 | gensim | 11,750 |
4 | allennlp | 9,712 |
5 | NLTK | 9,645 |
6 | bert-as-service | 8,904 |
7 | datasets | 6,802 |
8 | Stanza | 5,200 |
9 | mycroft-core | 4,918 |
10 | best-of-ml-python | 4,148 |
11 | GPT2-Chinese | 3,642 |
12 | sumy | 2,500 |
13 | jina | 2,364 |
14 | thinc | 2,199 |
15 | lingvo | 2,198 |
16 | PyTorch-NLP | 1,867 |
17 | aeneas | 1,825 |
18 | uda | 1,624 |
19 | textacy | 1,608 |
20 | haystack | 1,419 |
21 | magnitude | 1,386 |
22 | TextAttack | 1,263 |
23 | jiant | 1,124 |