Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge. Learn more →
Top 23 Python NLP Projects
-
Project mention: A look at Apple’s new Transformer-powered predictive text model | news.ycombinator.com | 2023-09-16
https://github.com/huggingface/transformers/blob/0a55d9f7376...
To summarize how they work: you keep some number of previously generated tokens, and once you get logits that you want to sample a new token from, you find the logits for existing tokens and multiply them by a penalty, thus lowering the probability of the corresponding tokens.
-
-
InfluxDB
Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.
-
HanLP
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
-
Project mention: Retrieval Augmented Generation (RAG): How To Get AI Models Learn Your Data & Give You Answers | dev.to | 2023-09-18
-
datasets
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
-
rasa
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
Project mention: RasaGPT: First headless LLM chatbot built on top of Rasa, Langchain and FastAPI | news.ycombinator.com | 2023-05-08It itself is not a GPT. It is a a framework of a framework project built on top of Rasa (https://github.com/RasaHQ/rasa) and Langchain which by default uses gpt3.5-turbo (change it in the .env file) or any foundation model you wish.
-
Project mention: Microsoft Publishes LongNet: Scaling Transformers to 1,000,000,000 Tokens | /r/ArtificialInteligence | 2023-07-08
The repository is available here.
-
Sonar
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
I'd like to share with you today the Chinese-Alpaca-Plus-13B-GPTQ model, which is the GPTQ format quantised 4bit models of Yiming Cui's Chinese-LLaMA-Alpaca 13B for GPU reference.
-
-
For Python, here's a nice compilation: https://github.com/ml-tooling/best-of-ml-python/blob/main/RE...
-
-
NLTK Documentation
-
PaddleHub
Awesome pre-trained models toolkit based on PaddlePaddle. (400+ models including Image, Text, Audio, Video and Cross-Modal with Easy Inference & Serving)
China: All of the ERNIE 260B cross-modal stuff.
-
haystack
:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
I recently conducted some experiments with Llama2 and Haystack (https://github.com/deepset-ai/haystack), the NLP/LLM framework.
The notebook can be helpful for those trying to load Llama2 on Colab.
1) Installed Transformers from the main branch (and other libraries)
-
PaddleNLP
👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
-
TextBlob
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
-
Project mention: [P] Making a TTS voice, HK-47 from Kotor using Tortoise (Ideally WaveRNN) | /r/MachineLearning | 2023-07-06
I don't test WaveRNN but from the ones that I know the best that is open source is FastPitch. And it's easy to use, here is the tutorial for voice cloning.
-
attention-is-all-you-need-pytorch
A PyTorch implementation of the Transformer model in "Attention is All You Need".
I did implement an "LLM" proof of concept from scratch in a course for my masters, pretty much doing a small implementation of a transformer from the Attention is all you Need paper (plus other resources). It was useless, but was a great experience to understand how it works. There are a few implementation like this out there, like this one: https://github.com/jadore801120/attention-is-all-you-need-pytorch (first google result). I think it is a fun exercise (the amount of fun depends on how much of a masochist you are :) ).
-
-
Stanza
Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
-
Project mention: Ask HN: Is there any open source/open hardware Echo Dot alike? | news.ycombinator.com | 2023-08-11
-
ERNIE
Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.
Project mention: [N] Baidu to Unveil Conversational AI ERNIE Bot on March 16 (Live) | /r/MachineLearning | 2023-03-14Found relevant code at https://github.com/PaddlePaddle/ERNIE + all code implementations here
-
-
Mergify
Updating dependencies is time-consuming.. Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free.
Python NLP related posts
- best way to serve llama V2 (llama.cpp VS triton VS HF text generation inference)
- A look at Apple’s new Transformer-powered predictive text model
- Deploying Llama2 with vLLM vs TGI. Need advice
- [P][R] Finetune LLMs via the Finetuning Hub
- Show HN: Leverage Falcon 7B blog post
- Show HN: New AI Dataset Based on LibGen and Sci-Hub
- Ernie, China's ChatGPT, Cracks Under Pressure
-
A note from our sponsor - InfluxDB
www.influxdata.com | 26 Sep 2023
Index
What are some of the best open-source NLP projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | transformers | 112,164 |
2 | bert | 35,290 |
3 | HanLP | 30,308 |
4 | spaCy | 27,161 |
5 | datasets | 17,155 |
6 | rasa | 17,009 |
7 | unilm | 15,488 |
8 | Chinese-LLaMA-Alpaca | 14,660 |
9 | gensim | 14,636 |
10 | best-of-ml-python | 14,459 |
11 | flair | 13,089 |
12 | NLTK | 12,352 |
13 | PaddleHub | 12,151 |
14 | haystack | 10,742 |
15 | PaddleNLP | 10,203 |
16 | TextBlob | 8,691 |
17 | NeMo | 7,940 |
18 | attention-is-all-you-need-pytorch | 7,875 |
19 | GPT2-Chinese | 7,155 |
20 | Stanza | 6,781 |
21 | mycroft-core | 6,330 |
22 | ERNIE | 5,996 |
23 | BERT-pytorch | 5,706 |