Python Natural Language Processing

Open-source Python projects categorized as Natural Language Processing

Top 23 Python Natural Language Processing Projects

  • GitHub repo transformers

    🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

    Project mention: Retrieval Augmented Generation with Huggingface Transformers and Ray | | 2021-02-10

    Improving the scalability RAG distributed fine tuning

  • GitHub repo funNLP

    中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌词库、汽车零件词库、连续英文切割、各种中文词向量、公司名字大全、古诗词库、IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据、百度中文问答数据集、句子相似度匹配算法集合、bert资源、文本生成&摘要相关工具、cocoNLP信息抽取工具、国内电话号码正则匹配、清华大学XLORE:中英文跨语言百科知识图谱、清华大学人工智能技术系列报告、自然语言生成、NLU太难了系列、自动对联数据及机器人、用户名黑名单列表、罪名法务名词及分类模型、微信公众号语料、cs224n深度学习自然语言处理课程、中文手写汉字识别、中文自然语言处理 语料/数据集、变量命名神器、分词语料库+代码、任务型对话英文数据集、ASR 语音数据集 + 基于深度学习的中文

  • Scout

    Get performance insights in less than 4 minutes. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.

  • GitHub repo Jieba


  • GitHub repo spaCy

    💫 Industrial-strength Natural Language Processing (NLP) in Python

    Project mention: Ask HN: What is your production ML stack like? (2021) | | 2021-02-08

    Here's the ML stack I have been using for my last project:

    - Doing NLP with spaCy ( as I consider it to be the most production ready framework for NLP

    - Annotating datasets with Prodigy (, a paid tool made by the spaCy team

    - Deploying the trained spaCy models onto NLP Cloud (

    - Use the models through the NLP Cloud API in production and enrich my Django application out of it

  • GitHub repo NLP-progress

    Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

    Project mention: What are some classification tasks where BERT-based models don't work well? In a similar vein, what are some generative tasks where fine-tuning GPT-2/LM does not work well? | | 2021-02-21

    One place to start is nlp progress if leader boards are your thing, if the model on top of the leader board is not a transformer based model and one further down is, you have your answer.

  • GitHub repo gensim

    Topic Modelling for Humans

    Project mention: Koan: A word2vec negative sampling implementation with correct CBOW update | | 2021-01-02

    Apparently it did:

  • GitHub repo allennlp

    An open-source NLP research library, built on PyTorch.

    Project mention: AllenNLP v2.0.0 | | 2021-01-27
  • GitHub repo NLTK

    NLTK Source

    Project mention: Wordnet and Sexism | | 2021-01-03
  • GitHub repo bert-as-service

    Mapping a variable-length sentence to a fixed-length vector using BERT model

    Project mention: Needed 100% to pass a safety quiz, need to wait a week to retake | | 2021-01-12

    You joke but

  • GitHub repo Pattern

    Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

  • GitHub repo datasets

    🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

    Project mention: Build an Embeddings index with Hugging Face Datasets | | 2021-01-28

    This article shows how txtai can index and search with Hugging Face's Datasets library. Datasets opens access to a large and growing list of publicly available datasets. Datasets has functionality to select, transform and filter data stored in each dataset.

  • GitHub repo pytext

    A natural language modeling framework based on PyTorch

  • GitHub repo SnowNLP

    Python library for processing Chinese text

  • GitHub repo pkuseg-python

    pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

  • GitHub repo Stanza

    Official Stanford NLP Python Library for Many Human Languages

  • GitHub repo mycroft-core

    Mycroft Core, the Mycroft Artificial Intelligence platform.

    Project mention: I want my Navi to greet me | | 2021-02-23

    Mycroft claims to be an open source customizable voice assistant. I've never used it so do your own research. If what they say is true, it sounds like it would work.

  • GitHub repo doccano

    Open source text annotation tool for machine learning practitioner.

  • GitHub repo thinc

    🔮 A refreshing functional take on deep learning, compatible with your favorite libraries

    Project mention: thinc - A refreshing functional take on deep learning, compatible with your favorite libraries | | 2021-02-17
  • GitHub repo PyTorch-NLP

    Basic Utilities for PyTorch Natural Language Processing (NLP)

  • GitHub repo polyglot

    Multilingual text (NLP) processing toolkit

  • GitHub repo

    Stand-alone language identification system

  • GitHub repo uda

    Unsupervised Data Augmentation (UDA)

    Project mention: A Visual Survey of Data Augmentation in NLP | | 2020-08-26

    The words that replaces the original word are chosen by calculating TF-IDF scores of words over the whole document and taking the lowest ones. You can refer to the code implementation for this in the original paper here.

  • GitHub repo textacy

    NLP, before and after spaCy

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-02-23.


What are some of the best open-source Natural Language Processing projects in Python? This list will help you:

Project Stars
1 transformers 41,393
2 funNLP 28,811
3 Jieba 25,553
4 spaCy 19,619
5 NLP-progress 17,810
6 gensim 11,750
7 allennlp 9,712
8 NLTK 9,645
9 bert-as-service 8,904
10 Pattern 7,803
11 datasets 6,802
12 pytext 6,131
13 SnowNLP 5,300
14 pkuseg-python 5,295
15 Stanza 5,200
16 mycroft-core 4,918
17 doccano 4,373
18 thinc 2,199
19 PyTorch-NLP 1,867
20 polyglot 1,773
21 1,728
22 uda 1,624
23 textacy 1,608