SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Natural Language Processing Projects
-
Project mention: QwQ-32B: Embracing the Power of Reinforcement Learning | news.ycombinator.com | 2025-03-05
Huggingface's transformers library supports something similar to this. You set a minimum length, and until that length is reached, the end of sequence token has no chance of being output.
https://github.com/huggingface/transformers/blob/51ed61e2f05...
S1 does something similar to put a lower limit on its reasoning output. End of thinking is represented with the <|im_start|> token, followed by the word 'answer'. IIRC the code dynamically adds/removes <|im_start|> to the list of suppressed tokens.
Both of these approaches set the probability to zero, not something small like you were suggesting.
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
funNLP
中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌词库、汽车零件词库、连续英文切割、各种中文词向量、公司名字大全、古诗词库、IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据、百度中文问答数据集、句子相似度匹配算法集合、bert资源、文本生成&摘要相关工具、cocoNLP信息抽取工具、国内电话号码正则匹配、清华大学XLORE:中英文跨语言百科知识图谱、清华大学人工智能技术系列报告、自然语言生成、NLU太难了系列、自动对联数据及机器人、用户名黑名单列表、罪名法务名词及分类模型、微信公众号语料、cs224n深度学习自然语言处理课程、中文手写汉字识别、中文自然语言处理 语料/数据集、变量命名神器、分词语料库+代码、任务型对话英文数据集、ASR 语音数据集 + 基于深度学习的中文
-
Project mention: A Novel Approach for Text Encryption Using Tokenizers in Ruby | dev.to | 2025-02-06
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
-
HanLP
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
-
Project mention: Show HN: Mandarin Word Segmenter with Translation | news.ycombinator.com | 2025-02-04
Thanks for the kind words!
I'm using Jieba[0] because it hits a nice balance of fast and accurate. But I'm initializing it with a custom dictionary (~800k entries), and have added several layers of heuristic post-segmentation. For example, Jieba tends to split up chengyu into two words, but I've decided they should be displayed as a single word, since chengyu are typically a single entry in dictionaries.
[0] https://github.com/fxsjy/jieba
-
Project mention: SpaCy – Industrial-Strength Natural Language Processing in Python | news.ycombinator.com | 2025-02-09
-
crewAI
Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
d2l-en
Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
-
NLP-progress
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
-
datasets
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Project mention: 20 Open Source Tools I Recommend to Build, Share, and Run AI Projects | dev.to | 2024-11-13Datasets library repository for accessing and sharing datasets with the community.
-
rasa
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
Rasa GitHub Repository
-
Ciphey
⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡
-
Qwen
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Project mention: Running Qwen, Nearly as Powerful as DeepSeek, on a MacBook Pro | dev.to | 2025-02-05Qwen (Qwen GitHub Repository) has been gaining attention recently as a powerful open-source large language model (LLM). I decided to give it a spin on my MacBook Pro using Ollama, a platform designed for running local LLMs. While Qwen2.5-Max boasts the highest performance, my setup could only handle the smaller Qwen2.5 (32B) model. Here's what I found!
-
-
Project mention: WhisperNER: Unified Open Named Entity and Speech Recognition | news.ycombinator.com | 2024-11-21
only the last string is a LOC named entity. Of course you can change definitions from the standard ones if you like, but then you should be careful not to compare with tools that use the original standard definition of NER such as flairNLP [1].
[1] https://github.com/flairNLP/flair?tab=readme-ov-file
-
Project mention: Mastering the Art of Conversational AI: Insights and Implementations with Python | dev.to | 2025-02-12
We can use NLTK, a powerful library for Python that provides easy-to-use interfaces to over 50 corpora and lexical resources.
-
-
Project mention: Show HN: Toolkit for LLM Fine-Tuning, Ablating and Testing | news.ycombinator.com | 2024-04-07
This is a great project, little bit similar to https://github.com/ludwig-ai/ludwig, but it includes testing capabilities and ablation.
questions regarding the LLM testing aspect: How extensive is the test coverage for LLM use cases, and what is the current state of this project area? Do you offer any guarantees, or is it considered an open-ended problem?
Would love to see more progress toward this area!
-
Here’s another one - it’s older but has some interesting charts and graphs.
https://arxiv.org/abs/2303.18223
-
camel
🐫 CAMEL: Finding the Scaling Law of Agents. The first and the best multi-agent framework. https://www.camel-ai.org
These use cases show off CAMEL-AI’s knack for teamwork and flexibility. Whether you’re automating, researching, or assisting, it’s got something for you. Ready to try it? Hit up the GitHub repo or chat with us on Discord. What’s your first project gonna be? Let’s make it happen!
-
-
TextBlob
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
-
attention-is-all-you-need-pytorch
A PyTorch implementation of the Transformer model in "Attention is All You Need".
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Natural Language Processing discussion
Python Natural Language Processing related posts
-
QwQ-32B: Embracing the Power of Reinforcement Learning
-
CrewAI – open-source framework for LLM agents
-
Mastering the Art of Conversational AI: Insights and Implementations with Python
-
SpaCy – Industrial-Strength Natural Language Processing in Python
-
A Novel Approach for Text Encryption Using Tokenizers in Ruby
-
Show HN: Mandarin Word Segmenter with Translation
-
Building an AI-powered Financial Behavior Analyzer with NodeJS, Python, SvelteKit, and TailwindCSS - Part 1: The AI Service
-
A note from our sponsor - SaaSHub
www.saashub.com | 21 Mar 2025
Index
What are some of the best open-source Natural Language Processing projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | transformers | 141,593 |
2 | funNLP | 71,798 |
3 | bert | 38,864 |
4 | HanLP | 34,609 |
5 | Jieba | 33,835 |
6 | spaCy | 31,139 |
7 | crewAI | 28,679 |
8 | d2l-en | 25,261 |
9 | NLP-progress | 22,817 |
10 | datasets | 19,793 |
11 | rasa | 19,683 |
12 | Ciphey | 18,839 |
13 | Qwen | 17,509 |
14 | gensim | 15,915 |
15 | flair | 14,108 |
16 | NLTK | 13,910 |
17 | MOSS | 12,033 |
18 | ludwig | 11,380 |
19 | LLMSurvey | 11,209 |
20 | camel | 10,760 |
21 | doccano | 9,849 |
22 | TextBlob | 9,285 |
23 | attention-is-all-you-need-pytorch | 9,067 |