Top 23 Natural Language Processing Open-Source Projects
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.Project mention: HuggingFace Bert Pytorch Implementation Question | reddit.com/r/learnmachinelearning | 2021-04-02
I'm walking through the BertModel code from HuggingFace (https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/modeling_bert.py) and it’s mostly straightforward except for the parts related to the “decoder” mode. I am confused about why there's a decoder mode for Bert.. From my understanding (may be wrong?) BERT is just an encoder part of the Transformer with MLM/NSP on top. So when would we need to use cross attention here?
中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌词库、汽车零件词库、连续英文切割、各种中文词向量、公司名字大全、古诗词库、IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据、百度中文问答数据集、句子相似度匹配算法集合、bert资源、文本生成&摘要相关工具、cocoNLP信息抽取工具、国内电话号码正则匹配、清华大学XLORE:中英文跨语言百科知识图谱、清华大学人工智能技术系列报告、自然语言生成、NLU太难了系列、自动对联数据及机器人、用户名黑名单列表、罪名法务名词及分类模型、微信公众号语料、cs224n深度学习自然语言处理课程、中文手写汉字识别、中文自然语言处理 语料/数据集、变量命名神器、分词语料库+代码、任务型对话英文数据集、ASR 语音数据集 + 基于深度学习的中文
Scout APM - Leading-edge performance monitoring starting at $39/month. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
结巴中文分词Project mention: Learn vocabulary effortlessly while browsing the web [FR,EN,DE,PT,ES] | reddit.com/r/languagelearning | 2021-03-23
Since you're saying the main issue is segmentation, there are libraries to help out with that issue. jieba is fantastic if you have a Python backend, nodejieba (50k downloads/week) if it's more JS-side.
💫 Industrial-strength Natural Language Processing (NLP) in PythonProject mention: NLP Help | Scraping Question | reddit.com/r/LanguageTechnology | 2021-04-19
This is a task that can be completed with NLP and it may be easier than you think. For novices I would recommend spaCy to get started. They have a lot of built in tools including the ability to break sentences down to their entities to identify is something is on-going or in the past. Good luck!
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.Project mention: How to do undergrad research the right way? | reddit.com/r/LanguageTechnology | 2021-04-14
NLP is a very broad topic and like you said it can be extremely overwhelming to keep up with all the recent advancements, especially if you are a beginner. I would suggest you to take a look at nlp_tasks or NLP-progress or The Big Bad NLP Database to get an idea of the different tasks in NLP and see if you can find anything that looks interesting to you.
Topic Modelling for HumansProject mention: Superior tools to Gensim's similarity | reddit.com/r/LanguageTechnology | 2021-03-20
So Gensim's Similarity module seems like a good fit for this problem, especially soft cosine similarity checking. But inside I can't get comfortable, because transformers are very popular lately.
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants (by RasaHQ)Project mention: Building a Social Engineering Chatbot for Cyber Security Awareness | reddit.com/r/artificial | 2021-04-05
There is a python framework named Rasa, it s really easy and is open source. I use it at work. As for the frontend, you can use botfront ui. https://github.com/RasaHQ/rasa https://github.com/botfront/rasa-webchat
A very simple framework for state-of-the-art Natural Language Processing (NLP)Project mention: SpaCy VS Transformers for NER | reddit.com/r/LanguageTechnology | 2021-03-11
For NER, if you don't need the full toolkit of spacy, I'd highly recommend checking out Flair. It will likely run faster than transformer-based models (like en_core_web_trf) and it tends to be one of the best performing approaches to NER.
An open-source NLP research library, built on PyTorch.Project mention: C4 dataset released (800GB Common Crawl-derived text; T5 training data) | reddit.com/r/mlscaling | 2021-03-16
NLTK SourceProject mention: Wordnet and Sexism | reddit.com/r/datascience | 2021-01-03
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.Project mention: What content would be useful to intermediate Data Scientist | reddit.com/r/datascience | 2021-04-12
Check out this repo. They collect hundreds of case studies, broken down by dozens of methodologies from large real-world companies such as AirBnB, Nvidia, Uber, Netflix etc.
general natural language facilities for node
Mapping a variable-length sentence to a fixed-length vector using BERT modelProject mention: Needed 100% to pass a safety quiz, need to wait a week to retake | reddit.com/r/mildlyinfuriating | 2021-01-12
You joke but
Stanford CoreNLP: A Java suite of core NLP tools.
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more. (by sloria)
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation toolsProject mention: Build an Embeddings index with Hugging Face Datasets | dev.to | 2021-01-28
This article shows how txtai can index and search with Hugging Face's Datasets library. Datasets opens access to a large and growing list of publicly available datasets. Datasets has functionality to select, transform and filter data stored in each dataset.
⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡ (by Ciphey)Project mention: So, You Want to Learn to Break Ciphers | news.ycombinator.com | 2021-02-13
A natural language modeling framework based on PyTorch
Python library for processing Chinese text
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation
Official Stanford NLP Python Library for Many Human Languages
Mycroft Core, the Mycroft Artificial Intelligence platform.Project mention: Using a Pi 4 for basic speech recognition in my new animatronic home assistant (GLaDOS from Portal 2) | reddit.com/r/raspberry_pi | 2021-04-12
Maybe the following would be of interest: reddit.com//r/Mycroftai https://mycroft.ai/
What are some of the best open-source Natural Language Processing projects? This list will help you: