Top 23 Python NLP Projects

transformers

173 124,557 10.0 Python

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Project mention: AI enthusiasm #6 - Finetune any LLM you want💡 | dev.to | 2024-04-16

Most of this tutorial is based on Hugging Face course about Transformers and on Niels Rogge's Transformers tutorials: make sure to check their work and give them a star on GitHub, if you please ❤️

bert

49 36,945 0.0 Python

TensorFlow code and pre-trained models for BERT

Project mention: OpenAI – Application for US trademark "GPT" has failed | news.ycombinator.com | 2024-02-15

task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pre-trained parameters.
[0] https://arxiv.org/abs/1810.04805

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
HanLP

3 32,214 5.6 Python

中文分词词性标注命名实体识别依存句法分析成分句法分析语义依存分析语义角色标注指代消解风格转换语义相似度新词发现关键词短语提取自动摘要文本分类聚类拼音简繁转换自然语言处理
spaCy

106 28,660 9.3 Python

💫 Industrial-strength Natural Language Processing (NLP) in Python

Project mention: Step by step guide to create customized chatbot by using spaCy (Python NLP library) | dev.to | 2024-03-23

Hi Community, In this article, I will demonstrate below steps to create your own chatbot by using spaCy (spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython):

datasets

15 18,345 9.5 Python

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Project mention: 🐍🐍 23 issues to grow yourself as an exceptional open-source Python expert 🧑‍💻 🥇 | dev.to | 2023-10-19

unilm

40 18,262 9.0 Python

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Project mention: The Era of 1-Bit LLMs: Training_Tips, Code And_FAQ [pdf] | news.ycombinator.com | 2024-03-21

rasa

16 17,919 9.6 Python

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Project mention: 🔥🚀 Top 10 Open-Source Must-Have Tools for Crafting Your Own Chatbot 🤖💬 | dev.to | 2023-11-06

Support Rasa on GitHub ⭐

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Chinese-LLaMA-Alpaca

4 17,140 8.8 Python

中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)

Project mention: Chinese-Alpaca-Plus-13B-GPTQ | /r/LocalLLaMA | 2023-05-30

I'd like to share with you today the Chinese-Alpaca-Plus-13B-GPTQ model, which is the GPTQ format quantised 4bit models of Yiming Cui's Chinese-LLaMA-Alpaca 13B for GPU reference.

best-of-ml-python

16 15,284 7.9 Python

🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.
gensim

18 15,212 7.5 Python

Topic Modelling for Humans

Project mention: Aggregating news from different sources | /r/learnprogramming | 2023-07-08

haystack

54 13,564 9.9 Python

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

Project mention: Release Radar • March 2024 Edition | dev.to | 2024-04-07

View on GitHub

flair

9 13,558 9.4 Python

A very simple framework for state-of-the-art Natural Language Processing (NLP)
NLTK

64 12,999 8.3 Python

NLTK Source

Project mention: Building a local AI smart Home Assistant | news.ycombinator.com | 2024-01-13

alternatively, could we not simply split by common characters such as newlines and periods, to split it within sentences? it would be fragile with special handling required for numbers with decimal points and probably various other edge cases, though.
there are also Python libraries meant for natural language parsing[0] that could do that task for us. I even see examples on stack overflow[1] that simply split text into sentences.
[0]: https://www.nltk.org/

PaddleHub

9 12,488 1.9 Python

Awesome pre-trained models toolkit based on PaddlePaddle. (400+ models including Image, Text, Audio, Video and Cross-Modal with Easy Inference & Serving)
PaddleNLP

2 11,386 9.8 Python

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
TextBlob

4 8,917 6.1 Python

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Project mention: Using EvaDB to build AI-enhanced apps | dev.to | 2024-01-10

TextBlob is a Python toolkit for text processing. It offers some common NLP functionalities such as part-of-speech tagging and noun phrase extraction. We’ll use TextBlob in our project to perform some quick sentiment analysis on tweets.

petals

98 8,631 8.5 Python

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

Project mention: Mistral Large | news.ycombinator.com | 2024-02-26

So how long until we can do an open source Mistral Large?
We could make a start on Petals or some other open source distributed training network cluster possibly?
[0] https://petals.dev/

attention-is-all-you-need-pytorch

3 8,409 0.0 Python

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Project mention: ElevenLabs Launches Voice Translation Tool to Break Down Language Barriers | news.ycombinator.com | 2023-10-10

The transformer model was invented to attend to context over the entire sequence length. Look at how the original authors used the Transformer for NMT in the original Vaswani et al publication. https://github.com/jadore801120/attention-is-all-you-need-py...

text-generation-inference

28 7,722 9.6 Python

Large Language Model Text Generation Inference

Project mention: Zephyr 141B, a Mixtral 8x22B fine-tune, is now available in Hugging Chat | news.ycombinator.com | 2024-04-12

I wanted to write that TGI inference engine is not Open Source anymore, but they have reverted the license back to Apache 2.0 for the new version TGI v2.0: https://github.com/huggingface/text-generation-inference/rel...
Good news!

GPT2-Chinese

2 7,342 2.8 Python

Chinese version of GPT2 training code, using BERT tokenizer.
Stanza

8 7,043 9.7 Python

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages

Project mention: Down and Out in the Magic Kingdom | news.ycombinator.com | 2023-07-23

txtai

354 6,910 9.3 Python

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

Project mention: Build knowledge graphs with LLM-driven entity extraction | dev.to | 2024-02-21

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

mycroft-core

212 6,448 0.0 Python

Mycroft Core, the Mycroft Artificial Intelligence platform.

Project mention: Rabbit R1, Designed by Teenage Engineering | news.ycombinator.com | 2024-01-09

It's indeed suspicious. You're sending your voice samples, your various services accounts, your location and more private data to some proprietary black box in some public cloud. Sorry, but this is a privacy nightmare. It should be open source and self-hosted like Mycroft (https://mycroft.ai) or Leon (https://getleon.ai) to be trustworthy.

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-16.

Python NLP related posts

AI enthusiasm #6 - Finetune any LLM you want💡
2 projects | dev.to | 16 Apr 2024
Fast and secure translation on your local machine with a GUI
6 projects | news.ycombinator.com | 13 Apr 2024
Zephyr 141B, a Mixtral 8x22B fine-tune, is now available in Hugging Chat
3 projects | news.ycombinator.com | 12 Apr 2024
PullRequestBenchmark Challenge: Can AI Replace Your Dev Team?
1 project | news.ycombinator.com | 10 Apr 2024
Hugging Face reverts the license back to Apache 2.0
1 project | news.ycombinator.com | 8 Apr 2024
HuggingFace text-generation-inference is reverting to Apache 2.0 License
2 projects | news.ycombinator.com | 8 Apr 2024
Schedule-Free Learning – A New Way to Train
3 projects | news.ycombinator.com | 6 Apr 2024
A note from our sponsor - WorkOS
workos.com | 19 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source NLP projects in Python? This list will help you:

	Project	Stars
1	transformers	124,557
2	bert	36,945
3	HanLP	32,214
4	spaCy	28,660
5	datasets	18,345
6	unilm	18,262
7	rasa	17,919
8	Chinese-LLaMA-Alpaca	17,140
9	best-of-ml-python	15,284
10	gensim	15,212
11	haystack	13,564
12	flair	13,558
13	NLTK	12,999
14	PaddleHub	12,488
15	PaddleNLP	11,386
16	TextBlob	8,917
17	petals	8,631
18	attention-is-all-you-need-pytorch	8,409
19	text-generation-inference	7,722
20	GPT2-Chinese	7,342
21	Stanza	7,043
22	txtai	6,910
23	mycroft-core	6,448