Python Natural Language Processing

Open-source Python projects categorized as Natural Language Processing

Top 23 Python Natural Language Processing Projects

  • GitHub repo transformers

    🤗Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

    Project mention: [eclectus] - a free tool for stock research, I used nlp to summarize important sec filings | reddit.com/r/SideProject | 2021-06-13

    If I was to do it again I would use Pegasus implemented with hugging face's tranformers https://huggingface.co/transformers/ https://huggingface.co/transformers/model_doc/pegasus.html

  • GitHub repo funNLP

    中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌词库、汽车零件词库、连续英文切割、各种中文词向量、公司名字大全、古诗词库、IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据、百度中文问答数据集、句子相似度匹配算法集合、bert资源、文本生成&摘要相关工具、cocoNLP信息抽取工具、国内电话号码正则匹配、清华大学XLORE:中英文跨语言百科知识图谱、清华大学人工智能技术系列报告、自然语言生成、NLU太难了系列、自动对联数据及机器人、用户名黑名单列表、罪名法务名词及分类模型、微信公众号语料、cs224n深度学习自然语言处理课程、中文手写汉字识别、中文自然语言处理 语料/数据集、变量命名神器、分词语料库+代码、任务型对话英文数据集、ASR 语音数据集 + 基于深度学习的中文

  • GitHub repo bert

    TensorFlow code and pre-trained models for BERT

    Project mention: [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | reddit.com/r/Regressions | 2021-06-02

    PDF link Landing page

  • GitHub repo Jieba

    结巴中文分词

    Project mention: Learn vocabulary effortlessly while browsing the web [FR,EN,DE,PT,ES] | reddit.com/r/languagelearning | 2021-03-23

    Since you're saying the main issue is segmentation, there are libraries to help out with that issue. jieba is fantastic if you have a Python backend, nodejieba (50k downloads/week) if it's more JS-side.

  • GitHub repo spaCy

    💫 Industrial-strength Natural Language Processing (NLP) in Python

    Project mention: Resume Advice Thread - June 08, 2021 | reddit.com/r/cscareerquestions | 2021-06-08

    "metadata" is "meta-data", "Spacy" is formally "spaCy", "Node" is formally "Node.js", "Mongo" is formally "MongoDB", "Websockets" is (possibly) "WebSocket", "twitter" is formally "Twitter", and "Javascript" is formally "JavaScript".

  • GitHub repo NLP-progress

    Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

    Project mention: What are state-of-the-art methods for abstractive text summarization ? | reddit.com/r/LanguageTechnology | 2021-06-03
  • GitHub repo gensim

    Topic Modelling for Humans

    Project mention: The Levenshtein Distance in Production | news.ycombinator.com | 2021-06-06

    > Problem statement: the Levenshtein distance is a string metric for measuring the difference between two sequences

    Another variant is "I have a bunch of words (a dictionary) and one query word, and want to find all words from the dictionary that are close to the query word".

    This leads to an interesting class of problems, because you can do clever things where you precompute search structures (Levenshtein automata [0]) from the dictionary. The similarity queries then run (much) faster – in production, performance matters.

    We recently merged a PR like that into Gensim [1].

    This gave a ~1,500x speed-up compared to naively comparing all pairwise strings with Levenshtein distance. A difference between the training step running for years (=unusable) and minutes.

    [0] http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levensht...

    [1] https://github.com/RaRe-Technologies/gensim/pull/3146

  • GitHub repo rasa

    💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants (by RasaHQ)

    Project mention: Building a Social Engineering Chatbot for Cyber Security Awareness | reddit.com/r/artificial | 2021-04-05

    There is a python framework named Rasa, it s really easy and is open source. I use it at work. As for the frontend, you can use botfront ui. https://github.com/RasaHQ/rasa https://github.com/botfront/rasa-webchat

  • GitHub repo flair

    A very simple framework for state-of-the-art Natural Language Processing (NLP)

    Project mention: Advice for how to approach classifying apartment posts on facebook? | reddit.com/r/LanguageTechnology | 2021-06-04

    For example, my first approach to the pet sentences would be to label all sentences within a respective text corpus containing according information for either yes or no. You would then convert this to a tertiary tag set, something like ["pet allowed", "pet not allowed", "irrelevant"]. You could then try out a model based on SentenceBert, other sentence-level embeddings/language models or 1D CNNs for this. flairNLP (https://github.com/flairNLP/flair) is a small, little framework which provides comfortable high-level access to different common language models which integrates perfectly with pyTorch.

  • GitHub repo allennlp

    An open-source NLP research library, built on PyTorch.

    Project mention: C4 dataset released (800GB Common Crawl-derived text; T5 training data) | reddit.com/r/mlscaling | 2021-03-16
  • GitHub repo d2l-en

    Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 175 universities.

    Project mention: I created a way to learn machine learning through Jupyter | reddit.com/r/learnmachinelearning | 2021-04-30

    There are actually some online books and courses built on Jupyter Notebook ([Dive to Deep Learning Book](https://github.com/d2l-ai/d2l-en) for example). However yours is more detail and could really helps beginners.

  • GitHub repo NLTK

    NLTK Source

    Project mention: Do programmers save chunks of code for repeated use? | reddit.com/r/learnpython | 2021-04-27

    Around 782 - https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/framenet.py

  • GitHub repo bert-as-service

    Mapping a variable-length sentence to a fixed-length vector using BERT model

    Project mention: Needed 100% to pass a safety quiz, need to wait a week to retake | reddit.com/r/mildlyinfuriating | 2021-01-12

    You joke but

  • GitHub repo datasets

    🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

    Project mention: Build an Embeddings index with Hugging Face Datasets | dev.to | 2021-01-28

    This article shows how txtai can index and search with Hugging Face's Datasets library. Datasets opens access to a large and growing list of publicly available datasets. Datasets has functionality to select, transform and filter data stored in each dataset.

  • GitHub repo Pattern

    Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

  • GitHub repo TextBlob

    Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more. (by sloria)

  • GitHub repo Ciphey

    ⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡

    Project mention: I wrote a tool that solves lame CTF Challenges by finding CTF Flags, IP Addresses, and more in pcap files, binaries or any text file | reddit.com/r/securityCTF | 2021-05-28
  • GitHub repo pytext

    A natural language modeling framework based on PyTorch

  • GitHub repo Stanza

    Official Stanford NLP Python Library for Many Human Languages

  • GitHub repo pkuseg-python

    pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

  • GitHub repo attention-is-all-you-need-pytorch

    A PyTorch implementation of the Transformer model in "Attention is All You Need".

    Project mention: Lack of activation in transformer feedforward layer? | reddit.com/r/learnmachinelearning | 2021-05-20

    I'm curious as to why the second matrix multiplication is not followed by an activation unlike the first one. Is there any particular reason why a non-linearity would be trivial or even avoided in the second operation? For reference, variations of this can be witnessed in a number of different implementations, including BERT-pytorch and attention-is-all-you-need-pytorch.

  • GitHub repo SnowNLP

    Python library for processing Chinese text

  • GitHub repo mycroft-core

    Mycroft Core, the Mycroft Artificial Intelligence platform.

    Project mention: Amazon plans to share your internet with your neighbors. This is how you opt out | reddit.com/r/technology | 2021-06-02

    It's going to be a hard sell to my wife to get rid of the Echo, so it looks like it's time to figure out setting up Mycroft

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-06-13.

Index

What are some of the best open-source Natural Language Processing projects in Python? This list will help you:

Project Stars
1 transformers 46,980
2 funNLP 31,527
3 bert 28,246
4 Jieba 26,455
5 spaCy 20,639
6 NLP-progress 18,612
7 gensim 12,156
8 rasa 11,527
9 flair 10,448
10 allennlp 10,093
11 d2l-en 10,071
12 NLTK 9,931
13 bert-as-service 9,315
14 datasets 8,352
15 Pattern 7,962
16 TextBlob 7,700
17 Ciphey 7,016
18 pytext 6,201
19 Stanza 5,469
20 pkuseg-python 5,456
21 attention-is-all-you-need-pytorch 5,431
22 SnowNLP 5,361
23 mycroft-core 5,148