Python NLP

Open-source Python projects categorized as NLP

Top 23 Python NLP Projects

  • transformers

    🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

    Project mention: A look at Apple’s new Transformer-powered predictive text model | news.ycombinator.com | 2023-09-16

    https://github.com/huggingface/transformers/blob/0a55d9f7376...

    To summarize how they work: you keep some number of previously generated tokens, and once you get logits that you want to sample a new token from, you find the logits for existing tokens and multiply them by a penalty, thus lowering the probability of the corresponding tokens.

  • bert

    TensorFlow code and pre-trained models for BERT

    Project mention: Ernie, China's ChatGPT, Cracks Under Pressure | news.ycombinator.com | 2023-09-07
  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

  • HanLP

    中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理

  • spaCy

    💫 Industrial-strength Natural Language Processing (NLP) in Python

    Project mention: Retrieval Augmented Generation (RAG): How To Get AI Models Learn Your Data & Give You Answers | dev.to | 2023-09-18
  • datasets

    🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

    Project mention: How to Train Large Models on Many GPUs? | news.ycombinator.com | 2023-02-11

    https://github.com/huggingface/datasets

    https://github.com/huggingface/transformers

  • rasa

    💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

    Project mention: RasaGPT: First headless LLM chatbot built on top of Rasa, Langchain and FastAPI | news.ycombinator.com | 2023-05-08

    It itself is not a GPT. It is a a framework of a framework project built on top of Rasa (https://github.com/RasaHQ/rasa) and Langchain which by default uses gpt3.5-turbo (change it in the .env file) or any foundation model you wish.

  • unilm

    Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

    Project mention: Microsoft Publishes LongNet: Scaling Transformers to 1,000,000,000 Tokens | /r/ArtificialInteligence | 2023-07-08

    The repository is available here.

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • Chinese-LLaMA-Alpaca

    中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)

    Project mention: Chinese-Alpaca-Plus-13B-GPTQ | /r/LocalLLaMA | 2023-05-30

    I'd like to share with you today the Chinese-Alpaca-Plus-13B-GPTQ model, which is the GPTQ format quantised 4bit models of Yiming Cui's Chinese-LLaMA-Alpaca 13B for GPU reference.

  • gensim

    Topic Modelling for Humans

    Project mention: Aggregating news from different sources | /r/learnprogramming | 2023-07-08
  • best-of-ml-python

    🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.

    Project mention: Ask HN: How to get back into AI? | news.ycombinator.com | 2022-12-10

    For Python, here's a nice compilation: https://github.com/ml-tooling/best-of-ml-python/blob/main/RE...

  • flair

    A very simple framework for state-of-the-art Natural Language Processing (NLP)

  • NLTK

    NLTK Source

    Project mention: Best Portfolio Projects for Data Science | dev.to | 2023-09-19

    NLTK Documentation

  • PaddleHub

    Awesome pre-trained models toolkit based on PaddlePaddle. (400+ models including Image, Text, Audio, Video and Cross-Modal with Easy Inference & Serving)

    Project mention: Where are all the multi-modal models? | /r/singularity | 2023-02-10

    China: All of the ERNIE 260B cross-modal stuff.

  • haystack

    :mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

    Project mention: Llama2 and Haystack on Colab | news.ycombinator.com | 2023-07-21

    I recently conducted some experiments with Llama2 and Haystack (https://github.com/deepset-ai/haystack), the NLP/LLM framework.

    The notebook can be helpful for those trying to load Llama2 on Colab.

    1) Installed Transformers from the main branch (and other libraries)

  • PaddleNLP

    👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

    Project mention: Chatgpt 到底是不是开源的? | /r/China_irl | 2023-03-25
  • TextBlob

    Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

  • NeMo

    NeMo: a toolkit for conversational AI

    Project mention: [P] Making a TTS voice, HK-47 from Kotor using Tortoise (Ideally WaveRNN) | /r/MachineLearning | 2023-07-06

    I don't test WaveRNN but from the ones that I know the best that is open source is FastPitch. And it's easy to use, here is the tutorial for voice cloning.

  • attention-is-all-you-need-pytorch

    A PyTorch implementation of the Transformer model in "Attention is All You Need".

    Project mention: Question: LLMs | /r/learnmachinelearning | 2023-07-06

    I did implement an "LLM" proof of concept from scratch in a course for my masters, pretty much doing a small implementation of a transformer from the Attention is all you Need paper (plus other resources). It was useless, but was a great experience to understand how it works. There are a few implementation like this out there, like this one: https://github.com/jadore801120/attention-is-all-you-need-pytorch (first google result). I think it is a fun exercise (the amount of fun depends on how much of a masochist you are :) ).

  • GPT2-Chinese

    Chinese version of GPT2 training code, using BERT tokenizer.

  • Stanza

    Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages

    Project mention: Down and Out in the Magic Kingdom | news.ycombinator.com | 2023-07-23
  • mycroft-core

    Mycroft Core, the Mycroft Artificial Intelligence platform.

    Project mention: Ask HN: Is there any open source/open hardware Echo Dot alike? | news.ycombinator.com | 2023-08-11
  • ERNIE

    Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

    Project mention: [N] Baidu to Unveil Conversational AI ERNIE Bot on March 16 (Live) | /r/MachineLearning | 2023-03-14

    Found relevant code at https://github.com/PaddlePaddle/ERNIE + all code implementations here

  • BERT-pytorch

    Google AI 2018 BERT pytorch implementation

  • Mergify

    Updating dependencies is time-consuming.. Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-09-19.

Python NLP related posts

Index

What are some of the best open-source NLP projects in Python? This list will help you:

Project Stars
1 transformers 112,164
2 bert 35,290
3 HanLP 30,308
4 spaCy 27,161
5 datasets 17,155
6 rasa 17,009
7 unilm 15,488
8 Chinese-LLaMA-Alpaca 14,660
9 gensim 14,636
10 best-of-ml-python 14,459
11 flair 13,089
12 NLTK 12,352
13 PaddleHub 12,151
14 haystack 10,742
15 PaddleNLP 10,203
16 TextBlob 8,691
17 NeMo 7,940
18 attention-is-all-you-need-pytorch 7,875
19 GPT2-Chinese 7,155
20 Stanza 6,781
21 mycroft-core 6,330
22 ERNIE 5,996
23 BERT-pytorch 5,706
Updating dependencies is time-consuming.
Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free.
blog.mergify.com