Top 23 Python NLP Projects
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.Project mention: Retrieval Augmented Generation with Huggingface Transformers and Ray | reddit.com/r/deeplearning | 2021-02-10
Improving the scalability RAG distributed fine tuning
💫 Industrial-strength Natural Language Processing (NLP) in PythonProject mention: Ask HN: What is your production ML stack like? (2021) | news.ycombinator.com | 2021-02-08
Here's the ML stack I have been using for my last project:
- Doing NLP with spaCy (https://spacy.io/) as I consider it to be the most production ready framework for NLP
- Annotating datasets with Prodigy (https://prodi.gy/), a paid tool made by the spaCy team
- Deploying the trained spaCy models onto NLP Cloud (https://nlpcloud.io)
- Use the models through the NLP Cloud API in production and enrich my Django application out of it
Get performance insights in less than 4 minutes. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
Topic Modelling for HumansProject mention: Koan: A word2vec negative sampling implementation with correct CBOW update | news.ycombinator.com | 2021-01-02
Apparently it did: https://github.com/RaRe-Technologies/gensim/issues/1873
An open-source NLP research library, built on PyTorch.Project mention: AllenNLP v2.0.0 | news.ycombinator.com | 2021-01-27
NLTK SourceProject mention: Wordnet and Sexism | reddit.com/r/datascience | 2021-01-03
Mapping a variable-length sentence to a fixed-length vector using BERT modelProject mention: Needed 100% to pass a safety quiz, need to wait a week to retake | reddit.com/r/mildlyinfuriating | 2021-01-12
You joke but
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation toolsProject mention: Build an Embeddings index with Hugging Face Datasets | dev.to | 2021-01-28
This article shows how txtai can index and search with Hugging Face's Datasets library. Datasets opens access to a large and growing list of publicly available datasets. Datasets has functionality to select, transform and filter data stored in each dataset.
Official Stanford NLP Python Library for Many Human Languages
Mycroft Core, the Mycroft Artificial Intelligence platform.Project mention: I want my Navi to greet me | reddit.com/r/Lain | 2021-02-23
Mycroft claims to be an open source customizable voice assistant. I've never used it so do your own research. If what they say is true, it sounds like it would work.
🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.Project mention: best-of-python: A ranked list of awesome Python libraries and tools | reddit.com/r/Python | 2021-01-14
Here ya go: https://github.com/ml-tooling/best-of-ml-python/pull/47
Chinese version of GPT2 training code, using BERT tokenizer.Project mention: 大陆可以逐步要求所有居民和企业每隔一段时间学习习的讲话和新闻评论，并上报思想总结吗？ | reddit.com/r/China_irl | 2021-02-15
Module for automatic summarization of text documents and HTML pages.
An easier way to build neural search on the cloudProject mention: Show HN: Jina – Open-source AI framework to build search for anything, fast | news.ycombinator.com | 2021-02-10
🔮 A refreshing functional take on deep learning, compatible with your favorite librariesProject mention: thinc - A refreshing functional take on deep learning, compatible with your favorite libraries | reddit.com/r/datascience | 2021-02-17
LingvoProject mention: Don’t Share That. Yet | news.ycombinator.com | 2021-01-05
Yes, there are really good open source speech to text tools (automatic speech recognition (ASR) is the common name for that).
Kaldi (https://kaldi-asr.org/) is probably the most well known, and supports hybrid NN-HMM and lattice-free MMI models. Kaldi is used by many people both in research and in production.
Lingvo (https://github.com/tensorflow/lingvo) is the open source version of Google speech recognition toolkit, with support mostly for end-to-end models.
ESPNet (https://github.com/espnet/espnet) is good and well known for end-to-end models as well.
RASR (https://github.com/rwth-i6/rasr) + RETURNN (https://github.com/rwth-i6/returnn) are very good as well, both for end-to-end models and hybrid NN-HMM, but they are for non-commercial applications only (or you need a commercial licence) (disclaimer: I work at the university chair which develops these frameworks).
Basic Utilities for PyTorch Natural Language Processing (NLP)
aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
Unsupervised Data Augmentation (UDA)Project mention: A Visual Survey of Data Augmentation in NLP | dev.to | 2020-08-26
The words that replaces the original word are chosen by calculating TF-IDF scores of words over the whole document and taking the lowest ones. You can refer to the code implementation for this in the original paper here.
NLP, before and after spaCy
:mag: End-to-end Python framework for building natural language search interfaces to data. Leverages Transformers and the State-of-the-Art of NLP. Supports DPR, Elasticsearch, HuggingFace’s Modelhub, and much more! (by deepset-ai)Project mention: Recommendations for Semantic Search | reddit.com/r/LanguageTechnology | 2021-01-13
A fast, efficient universal vector embedding utility package.Project mention: Build an Embeddings index from a data source | dev.to | 2021-02-17
General language models from pymagnitude
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLPProject mention: A Visual Survey of Data Augmentation in NLP | dev.to | 2020-08-26
Libraries like nlpaug and textattack provide simple and consistent API to apply the above NLP data augmentation methods in Python. They are framework agnostic and can be easily integrated into your pipeline.
jiant is an NLP toolkitProject mention: Looking for a code base to implement multi-task learning in NLP | reddit.com/r/LanguageTechnology | 2021-02-22
Jiant should fulfill 1, 2, 4 and 5.
What are some of the best open-source NLP projects in Python? This list will help you: