Python NLP

Open-source Python projects categorized as NLP

Top 23 Python NLP Projects

  • GitHub repo transformers

    🤗Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

    Project mention: [eclectus] - a free tool for stock research, I used nlp to summarize important sec filings | reddit.com/r/SideProject | 2021-06-13

    If I was to do it again I would use Pegasus implemented with hugging face's tranformers https://huggingface.co/transformers/ https://huggingface.co/transformers/model_doc/pegasus.html

  • GitHub repo bert

    TensorFlow code and pre-trained models for BERT

    Project mention: [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | reddit.com/r/Regressions | 2021-06-02

    PDF link Landing page

  • GitHub repo spaCy

    💫 Industrial-strength Natural Language Processing (NLP) in Python

    Project mention: Resume Advice Thread - June 08, 2021 | reddit.com/r/cscareerquestions | 2021-06-08

    "metadata" is "meta-data", "Spacy" is formally "spaCy", "Node" is formally "Node.js", "Mongo" is formally "MongoDB", "Websockets" is (possibly) "WebSocket", "twitter" is formally "Twitter", and "Javascript" is formally "JavaScript".

  • GitHub repo gensim

    Topic Modelling for Humans

    Project mention: The Levenshtein Distance in Production | news.ycombinator.com | 2021-06-06

    > Problem statement: the Levenshtein distance is a string metric for measuring the difference between two sequences

    Another variant is "I have a bunch of words (a dictionary) and one query word, and want to find all words from the dictionary that are close to the query word".

    This leads to an interesting class of problems, because you can do clever things where you precompute search structures (Levenshtein automata [0]) from the dictionary. The similarity queries then run (much) faster – in production, performance matters.

    We recently merged a PR like that into Gensim [1].

    This gave a ~1,500x speed-up compared to naively comparing all pairwise strings with Levenshtein distance. A difference between the training step running for years (=unusable) and minutes.

    [0] http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levensht...

    [1] https://github.com/RaRe-Technologies/gensim/pull/3146

  • GitHub repo rasa

    💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants (by RasaHQ)

    Project mention: Building a Social Engineering Chatbot for Cyber Security Awareness | reddit.com/r/artificial | 2021-04-05

    There is a python framework named Rasa, it s really easy and is open source. I use it at work. As for the frontend, you can use botfront ui. https://github.com/RasaHQ/rasa https://github.com/botfront/rasa-webchat

  • GitHub repo flair

    A very simple framework for state-of-the-art Natural Language Processing (NLP)

    Project mention: Advice for how to approach classifying apartment posts on facebook? | reddit.com/r/LanguageTechnology | 2021-06-04

    For example, my first approach to the pet sentences would be to label all sentences within a respective text corpus containing according information for either yes or no. You would then convert this to a tertiary tag set, something like ["pet allowed", "pet not allowed", "irrelevant"]. You could then try out a model based on SentenceBert, other sentence-level embeddings/language models or 1D CNNs for this. flairNLP (https://github.com/flairNLP/flair) is a small, little framework which provides comfortable high-level access to different common language models which integrates perfectly with pyTorch.

  • GitHub repo allennlp

    An open-source NLP research library, built on PyTorch.

    Project mention: C4 dataset released (800GB Common Crawl-derived text; T5 training data) | reddit.com/r/mlscaling | 2021-03-16
  • GitHub repo NLTK

    NLTK Source

    Project mention: Do programmers save chunks of code for repeated use? | reddit.com/r/learnpython | 2021-04-27

    Around 782 - https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/framenet.py

  • GitHub repo bert-as-service

    Mapping a variable-length sentence to a fixed-length vector using BERT model

    Project mention: Needed 100% to pass a safety quiz, need to wait a week to retake | reddit.com/r/mildlyinfuriating | 2021-01-12

    You joke but

  • GitHub repo datasets

    🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

    Project mention: Build an Embeddings index with Hugging Face Datasets | dev.to | 2021-01-28

    This article shows how txtai can index and search with Hugging Face's Datasets library. Datasets opens access to a large and growing list of publicly available datasets. Datasets has functionality to select, transform and filter data stored in each dataset.

  • GitHub repo TextBlob

    Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more. (by sloria)

  • GitHub repo PaddleHub

    Awesome pre-trained models toolkit based on PaddlePaddle.(300+ models including Image, Text, Audio and Video with Easy Inference & Serving deployment)

    Project mention: [P] PaddleHub: An awesome and easy-to-use pre-trained models toolkit | reddit.com/r/MachineLearning | 2021-06-10

    code:https://github.com/PaddlePaddle/PaddleHub

  • GitHub repo Stanza

    Official Stanford NLP Python Library for Many Human Languages

  • GitHub repo attention-is-all-you-need-pytorch

    A PyTorch implementation of the Transformer model in "Attention is All You Need".

    Project mention: Lack of activation in transformer feedforward layer? | reddit.com/r/learnmachinelearning | 2021-05-20

    I'm curious as to why the second matrix multiplication is not followed by an activation unlike the first one. Is there any particular reason why a non-linearity would be trivial or even avoided in the second operation? For reference, variations of this can be witnessed in a number of different implementations, including BERT-pytorch and attention-is-all-you-need-pytorch.

  • GitHub repo best-of-ml-python

    🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.

    Project mention: Are there any speech recognition modules so I can write one and do not have to rely on google and the likes? | reddit.com/r/learnmachinelearning | 2021-04-18
  • GitHub repo mycroft-core

    Mycroft Core, the Mycroft Artificial Intelligence platform.

    Project mention: Amazon plans to share your internet with your neighbors. This is how you opt out | reddit.com/r/technology | 2021-06-02

    It's going to be a hard sell to my wife to get rid of the Echo, so it looks like it's time to figure out setting up Mycroft

  • GitHub repo flashtext

    Extract Keywords from sentence or Replace keywords in sentences. (by vi3k6i5)

    Project mention: Quickest way to check that 14000 strings arent in An original string. | reddit.com/r/learnpython | 2021-04-15
  • GitHub repo BERT-pytorch

    Google AI 2018 BERT pytorch implementation

    Project mention: Lack of activation in transformer feedforward layer? | reddit.com/r/learnmachinelearning | 2021-05-20

    I'm curious as to why the second matrix multiplication is not followed by an activation unlike the first one. Is there any particular reason why a non-linearity would be trivial or even avoided in the second operation? For reference, variations of this can be witnessed in a number of different implementations, including BERT-pytorch and attention-is-all-you-need-pytorch.

  • GitHub repo GPT2-Chinese

    Chinese version of GPT2 training code, using BERT tokenizer.

    Project mention: 大陆可以逐步要求所有居民和企业每隔一段时间学习习的讲话和新闻评论,并上报思想总结吗? | reddit.com/r/China_irl | 2021-02-15
  • GitHub repo jina

    An easier way to build neural search on the cloud

    Project mention: My open-source project is on Github trending #1 spot. I'm elated :), AMA | reddit.com/r/github | 2021-06-15

    It's been almost 1.5 yr since we launched this open source project Jina - A Neural Search framework. And today, we end up in Github Trending #1 spot.

  • GitHub repo bertviz

    Tool for visualizing attention in the Transformer model (BERT, GPT-2, Albert, XLNet, RoBERTa, CTRL, etc.)

    Project mention: At which linguistic patterns and features attention heads of BERT look to ? | reddit.com/r/LanguageTechnology | 2021-04-13

    As indirectly mentioned before, you can visualize the attention in you model with the bertviz package: https://github.com/jessevig/bertviz

  • GitHub repo text

    Data loaders and abstractions for text and NLP (by pytorch)

    Project mention: Tutorials/walkthroughs of torchtext 0.9 anywhere? | reddit.com/r/pytorch | 2021-06-04

    You can find the migration tutorial here https://github.com/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb

  • GitHub repo sumy

    Module for automatic summarization of text documents and HTML pages.

    Project mention: Sumy – module for automatic summarization of text documents and HTML pages | news.ycombinator.com | 2021-05-29
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-06-15.

Index

What are some of the best open-source NLP projects in Python? This list will help you:

Project Stars
1 transformers 46,980
2 bert 28,246
3 spaCy 20,639
4 gensim 12,156
5 rasa 11,527
6 flair 10,448
7 allennlp 10,093
8 NLTK 9,931
9 bert-as-service 9,315
10 datasets 8,352
11 TextBlob 7,700
12 PaddleHub 6,200
13 Stanza 5,469
14 attention-is-all-you-need-pytorch 5,431
15 best-of-ml-python 5,300
16 mycroft-core 5,148
17 flashtext 4,783
18 BERT-pytorch 4,283
19 GPT2-Chinese 4,010
20 jina 3,919
21 bertviz 2,914
22 text 2,786
23 sumy 2,591