Python NLP

Open-source Python projects categorized as NLP

Top 23 Python NLP Projects

  • GitHub repo transformers

    🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

    Project mention: Retrieval Augmented Generation with Huggingface Transformers and Ray | | 2021-02-10

    Improving the scalability RAG distributed fine tuning

  • GitHub repo spaCy

    💫 Industrial-strength Natural Language Processing (NLP) in Python

    Project mention: Ask HN: What is your production ML stack like? (2021) | | 2021-02-08

    Here's the ML stack I have been using for my last project:

    - Doing NLP with spaCy ( as I consider it to be the most production ready framework for NLP

    - Annotating datasets with Prodigy (, a paid tool made by the spaCy team

    - Deploying the trained spaCy models onto NLP Cloud (

    - Use the models through the NLP Cloud API in production and enrich my Django application out of it

  • Scout

    Get performance insights in less than 4 minutes. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.

  • GitHub repo gensim

    Topic Modelling for Humans

    Project mention: Koan: A word2vec negative sampling implementation with correct CBOW update | | 2021-01-02

    Apparently it did:

  • GitHub repo allennlp

    An open-source NLP research library, built on PyTorch.

    Project mention: AllenNLP v2.0.0 | | 2021-01-27
  • GitHub repo NLTK

    NLTK Source

    Project mention: Wordnet and Sexism | | 2021-01-03
  • GitHub repo bert-as-service

    Mapping a variable-length sentence to a fixed-length vector using BERT model

    Project mention: Needed 100% to pass a safety quiz, need to wait a week to retake | | 2021-01-12

    You joke but

  • GitHub repo datasets

    🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

    Project mention: Build an Embeddings index with Hugging Face Datasets | | 2021-01-28

    This article shows how txtai can index and search with Hugging Face's Datasets library. Datasets opens access to a large and growing list of publicly available datasets. Datasets has functionality to select, transform and filter data stored in each dataset.

  • GitHub repo Stanza

    Official Stanford NLP Python Library for Many Human Languages

  • GitHub repo mycroft-core

    Mycroft Core, the Mycroft Artificial Intelligence platform.

    Project mention: I want my Navi to greet me | | 2021-02-23

    Mycroft claims to be an open source customizable voice assistant. I've never used it so do your own research. If what they say is true, it sounds like it would work.

  • GitHub repo best-of-ml-python

    🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.

    Project mention: best-of-python: A ranked list of awesome Python libraries and tools | | 2021-01-14

    Here ya go:

  • GitHub repo GPT2-Chinese

    Chinese version of GPT2 training code, using BERT tokenizer.

    Project mention: 大陆可以逐步要求所有居民和企业每隔一段时间学习习的讲话和新闻评论,并上报思想总结吗? | | 2021-02-15
  • GitHub repo sumy

    Module for automatic summarization of text documents and HTML pages.

  • GitHub repo jina

    An easier way to build neural search on the cloud

    Project mention: Show HN: Jina – Open-source AI framework to build search for anything, fast | | 2021-02-10
  • GitHub repo thinc

    🔮 A refreshing functional take on deep learning, compatible with your favorite libraries

    Project mention: thinc - A refreshing functional take on deep learning, compatible with your favorite libraries | | 2021-02-17
  • GitHub repo lingvo


    Project mention: Don’t Share That. Yet | | 2021-01-05

    Yes, there are really good open source speech to text tools (automatic speech recognition (ASR) is the common name for that).

    Kaldi ( is probably the most well known, and supports hybrid NN-HMM and lattice-free MMI models. Kaldi is used by many people both in research and in production.

    Lingvo ( is the open source version of Google speech recognition toolkit, with support mostly for end-to-end models.

    ESPNet ( is good and well known for end-to-end models as well.

    RASR ( + RETURNN ( are very good as well, both for end-to-end models and hybrid NN-HMM, but they are for non-commercial applications only (or you need a commercial licence) (disclaimer: I work at the university chair which develops these frameworks).

  • GitHub repo PyTorch-NLP

    Basic Utilities for PyTorch Natural Language Processing (NLP)

  • GitHub repo aeneas

    aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)

    Project mention: Show HN: A retrainable subtitle synchronizer you can now build your own | | 2021-01-31

    here's another solution:

  • GitHub repo uda

    Unsupervised Data Augmentation (UDA)

    Project mention: A Visual Survey of Data Augmentation in NLP | | 2020-08-26

    The words that replaces the original word are chosen by calculating TF-IDF scores of words over the whole document and taking the lowest ones. You can refer to the code implementation for this in the original paper here.

  • GitHub repo textacy

    NLP, before and after spaCy

  • GitHub repo haystack

    :mag: End-to-end Python framework for building natural language search interfaces to data. Leverages Transformers and the State-of-the-Art of NLP. Supports DPR, Elasticsearch, HuggingFace’s Modelhub, and much more! (by deepset-ai)

    Project mention: Recommendations for Semantic Search | | 2021-01-13
  • GitHub repo magnitude

    A fast, efficient universal vector embedding utility package.

    Project mention: Build an Embeddings index from a data source | | 2021-02-17

    General language models from pymagnitude

  • GitHub repo TextAttack

    TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

    Project mention: A Visual Survey of Data Augmentation in NLP | | 2020-08-26

    Libraries like nlpaug and textattack provide simple and consistent API to apply the above NLP data augmentation methods in Python. They are framework agnostic and can be easily integrated into your pipeline.

  • GitHub repo jiant

    jiant is an NLP toolkit

    Project mention: Looking for a code base to implement multi-task learning in NLP | | 2021-02-22

    Jiant should fulfill 1, 2, 4 and 5.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-02-23.


What are some of the best open-source NLP projects in Python? This list will help you:

Project Stars
1 transformers 41,393
2 spaCy 19,619
3 gensim 11,750
4 allennlp 9,712
5 NLTK 9,645
6 bert-as-service 8,904
7 datasets 6,802
8 Stanza 5,200
9 mycroft-core 4,918
10 best-of-ml-python 4,148
11 GPT2-Chinese 3,642
12 sumy 2,500
13 jina 2,364
14 thinc 2,199
15 lingvo 2,198
16 PyTorch-NLP 1,867
17 aeneas 1,825
18 uda 1,624
19 textacy 1,608
20 haystack 1,419
21 magnitude 1,386
22 TextAttack 1,263
23 jiant 1,124