Top 23 Natural Language Processing Open-Source Projects
-
transformers
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
Project mention: HuggingFace Bert Pytorch Implementation Question | reddit.com/r/learnmachinelearning | 2021-04-02I'm walking through the BertModel code from HuggingFace (https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/modeling_bert.py) and it’s mostly straightforward except for the parts related to the “decoder” mode. I am confused about why there's a decoder mode for Bert.. From my understanding (may be wrong?) BERT is just an encoder part of the Transformer with MLM/NSP on top. So when would we need to use cross attention here?
-
funNLP
中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌词库、汽车零件词库、连续英文切割、各种中文词向量、公司名字大全、古诗词库、IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据、百度中文问答数据集、句子相似度匹配算法集合、bert资源、文本生成&摘要相关工具、cocoNLP信息抽取工具、国内电话号码正则匹配、清华大学XLORE:中英文跨语言百科知识图谱、清华大学人工智能技术系列报告、自然语言生成、NLU太难了系列、自动对联数据及机器人、用户名黑名单列表、罪名法务名词及分类模型、微信公众号语料、cs224n深度学习自然语言处理课程、中文手写汉字识别、中文自然语言处理 语料/数据集、变量命名神器、分词语料库+代码、任务型对话英文数据集、ASR 语音数据集 + 基于深度学习的中文
-
Scout APM
Scout APM - Leading-edge performance monitoring starting at $39/month. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
-
Project mention: Learn vocabulary effortlessly while browsing the web [FR,EN,DE,PT,ES] | reddit.com/r/languagelearning | 2021-03-23
Since you're saying the main issue is segmentation, there are libraries to help out with that issue. jieba is fantastic if you have a Python backend, nodejieba (50k downloads/week) if it's more JS-side.
-
This is a task that can be completed with NLP and it may be easier than you think. For novices I would recommend spaCy to get started. They have a lot of built in tools including the ability to break sentences down to their entities to identify is something is on-going or in the past. Good luck!
-
NLP-progress
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
Project mention: How to do undergrad research the right way? | reddit.com/r/LanguageTechnology | 2021-04-14NLP is a very broad topic and like you said it can be extremely overwhelming to keep up with all the recent advancements, especially if you are a beginner. I would suggest you to take a look at nlp_tasks or NLP-progress or The Big Bad NLP Database to get an idea of the different tasks in NLP and see if you can find anything that looks interesting to you.
-
Project mention: Superior tools to Gensim's similarity | reddit.com/r/LanguageTechnology | 2021-03-20
So Gensim's Similarity module seems like a good fit for this problem, especially soft cosine similarity checking. But inside I can't get comfortable, because transformers are very popular lately.
-
rasa
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants (by RasaHQ)
Project mention: Building a Social Engineering Chatbot for Cyber Security Awareness | reddit.com/r/artificial | 2021-04-05There is a python framework named Rasa, it s really easy and is open source. I use it at work. As for the frontend, you can use botfront ui. https://github.com/RasaHQ/rasa https://github.com/botfront/rasa-webchat
-
For NER, if you don't need the full toolkit of spacy, I'd highly recommend checking out Flair. It will likely run faster than transformer-based models (like en_core_web_trf) and it tends to be one of the best performing approaches to NER.
-
Project mention: C4 dataset released (800GB Common Crawl-derived text; T5 training data) | reddit.com/r/mlscaling | 2021-03-16
-
-
applied-ml
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Project mention: What content would be useful to intermediate Data Scientist | reddit.com/r/datascience | 2021-04-12Check out this repo. They collect hundreds of case studies, broken down by dozens of methodologies from large real-world companies such as AirBnB, Nvidia, Uber, Netflix etc.
-
-
Project mention: Needed 100% to pass a safety quiz, need to wait a week to retake | reddit.com/r/mildlyinfuriating | 2021-01-12
You joke but
-
-
Pattern
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
-
TextBlob
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more. (by sloria)
-
datasets
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools
This article shows how txtai can index and search with Hugging Face's Datasets library. Datasets opens access to a large and growing list of publicly available datasets. Datasets has functionality to select, transform and filter data stored in each dataset.
-
Ciphey
⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡ (by Ciphey)
-
-
-
-
-
Project mention: Using a Pi 4 for basic speech recognition in my new animatronic home assistant (GLaDOS from Portal 2) | reddit.com/r/raspberry_pi | 2021-04-12
Maybe the following would be of interest: reddit.com//r/Mycroftai https://mycroft.ai/
Index
What are some of the best open-source Natural Language Processing projects? This list will help you:
Project | Stars | |
---|---|---|
1 | transformers | 43,997 |
2 | funNLP | 29,991 |
3 | Jieba | 25,978 |
4 | spaCy | 20,186 |
5 | NLP-progress | 18,313 |
6 | gensim | 11,941 |
7 | rasa | 11,146 |
8 | flair | 10,224 |
9 | allennlp | 9,920 |
10 | NLTK | 9,803 |
11 | applied-ml | 9,611 |
12 | natural | 9,560 |
13 | bert-as-service | 9,125 |
14 | CoreNLP | 7,916 |
15 | Pattern | 7,858 |
16 | TextBlob | 7,615 |
17 | datasets | 7,211 |
18 | Ciphey | 6,669 |
19 | pytext | 6,157 |
20 | SnowNLP | 5,361 |
21 | pkuseg-python | 5,361 |
22 | Stanza | 5,355 |
23 | mycroft-core | 5,035 |