Top 23 Python Natural Language Processing Projects
-
transformers
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
4. Repeat
For step 3 you need to send the gradients from each GPU somewhere, and then send back either the averaged gradient or the updated model weights. So when the model is large (say, 3GB for GPT 774M!) that's a lot of GPU-GPU communication!
You're right that for the vast majority of ML cases, the models are small enough that the synchronization cost is negligible, though.
I wrote up some benchmarks here:
-
funNLP
中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌词库、汽车零件词库、连续英文切割、各种中文词向量、公司名字大全、古诗词库、IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据、百度中文问答数据集、句子相似度匹配算法集合、bert资源、文本生成&摘要相关工具、cocoNLP信息抽取工具、国内电话号码正则匹配、清华大学XLORE:中英文跨语言百科知识图谱、清华大学人工智能技术系列报告、自然语言生成、NLU太难了系列、自动对联数据及机器人、用户名黑名单列表、罪名法务名词及分类模型、微信公众号语料、cs224n深度学习自然语言处理课程、中文手写汉字识别、中文自然语言处理 语料/数据集、变量命名神器、分词语料库+代码、任务型对话英文数据集、ASR 语音数据集 + 基于深度学习的中文
-
jieba
结巴中文分词
-
spaCy
💫 Industrial-strength Natural Language Processing (NLP) with Python and Cython
-
gensim
Topic Modelling for Humans
Latest mention: Koan: A word2vec negative sampling implementation with correct CBOW update | news.ycombinator.com | 2021-01-02Apparently it did: https://github.com/RaRe-Technologies/gensim/issues/1873
-
nltk
NLTK Source
-
bert-as-service
Mapping a variable-length sentence to a fixed-length vector using BERT model
Latest mention: Needed 100% to pass a safety quiz, need to wait a week to retake | reddit.com/r/mildlyinfuriating | 2021-01-12You joke but
-
pattern
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
-
textblob
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
-
datasets
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Latest mention: [P] 611 text datasets in 467 languages in the new v1.2 release of HuggingFace datasets library | reddit.com/r/MachineLearning | 2021-01-05There will be 13 more bytthe end of this week, from Microsoft CodeXGlue, I had not the time to fix my PR earlier : https://github.com/huggingface/datasets/pull/997 .
-
ciphey
⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡
If you want a code cracked, in case of #2, there are many resources online and offline that can help you solve this. One of them is the excellent /r/codes subreddit, which may be able to point you in the right direction. Also interesting is Ciphey, which can identify and decrypt a lot of codes.
-
pytext
A natural language modeling framework based on PyTorch
-
snownlp
Python library for processing Chinese text
-
pkuseg-python
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation
-
stanza
Official Stanford NLP Python Library for Many Human Languages
-
doccano
Open source text annotation tool for machine learning practitioner.
-
PyTorch-NLP
Basic Utilities for PyTorch Natural Language Processing (NLP)
-
polyglot
Multilingual text (NLP) processing toolkit
-
langid.py
Stand-alone language identification system
-
uda
Unsupervised Data Augmentation (UDA)
The words that replaces the original word are chosen by calculating TF-IDF scores of words over the whole document and taking the lowest ones. You can refer to the code implementation for this in the original paper here.
-
textacy
NLP, before and after spaCy
-
quepy
A python framework to transform natural language questions to queries in a database query language.
-
TextAttack
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP
Libraries like nlpaug and textattack provide simple and consistent API to apply the above NLP data augmentation methods in Python. They are framework agnostic and can be easily integrated into your pipeline.
Index
What are some of the best open-source Natural Language Processing projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | transformers | 39,664 |
2 | funNLP | 28,070 |
3 | jieba | 25,296 |
4 | spaCy | 18,095 |
5 | gensim | 11,612 |
6 | nltk | 9,563 |
7 | bert-as-service | 8,791 |
8 | pattern | 7,755 |
9 | textblob | 7,489 |
10 | datasets | 6,551 |
11 | ciphey | 6,160 |
12 | pytext | 6,118 |
13 | snownlp | 5,241 |
14 | pkuseg-python | 5,240 |
15 | stanza | 5,119 |
16 | doccano | 4,188 |
17 | PyTorch-NLP | 1,851 |
18 | polyglot | 1,750 |
19 | langid.py | 1,707 |
20 | uda | 1,594 |
21 | textacy | 1,592 |
22 | quepy | 1,197 |
23 | TextAttack | 1,192 |