Top 16 Python Retrieval Projects
-
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
NeumAI
Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.
-
memorizing-transformers-pytorch
Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
searchGPT
Grounded search engine (i.e. with source reference) based on LLM / ChatGPT / OpenAI API. It supports web search, file content search etc.
-
raptor
The official implementation of RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
-
ACT
Atmospheric data Community Toolkit - A python based toolkit for exploring and analyzing time series atmospheric datasets (by ARM-DOE)
-
retomaton
PyTorch code for the RetoMaton paper: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022)
-
ragswift
🚀 Scale your RAG pipeline using Ragswift: A scalable centralized embeddings management platform
-
SHREC2023-ANIMAR
Source codes of team TikTorch (1st place solution) for track 2 and 3 of the SHREC2023 Challenge
-
FloridaPropertyData
A Python-based tool for retrieving and processing property data for specific counties in Florida using Parcel ID numbers. Simplifies data retrieval and offers customization options for real estate agents, investors, and government officials.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:
- Chunking can interfer with context boundaries
- Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)
- Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)
- RAG will miserably fail with requests like "summarize the whole document"
- to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb
1 https://github.com/underlines/awesome-marketing-datascience/...
The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard
Project mention: Show HN: R2R – Open-source framework for production-grade RAG | news.ycombinator.com | 2024-02-26
Project mention: FastLLM by Qdrant – lightweight LLM tailored For RAG | news.ycombinator.com | 2024-04-01
Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | news.ycombinator.com | 2023-11-21Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...
At one point I experimented a little with transformers that had access to external memory searchable via KNN lookups https://github.com/lucidrains/memorizing-transformers-pytorc... or via routed queries with https://github.com/glassroom/heinsen_routing . Both approaches seemed to work for me, but I had to put that work on hold for reasons outside my control.
Project mention: RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation | news.ycombinator.com | 2024-04-30Worth a comparison with RAPTOR, another tiered RAG system.
https://arxiv.org/abs/2401.18059
If you are interested, you can check out the documentation here: https://github.com/raphaelsty/cherche
Project mention: Show HN: Ragswift – Scalable embeddings platform powered by distributed compute | news.ycombinator.com | 2024-01-22
Python Retrieval related posts
Index
What are some of the best open-source Retrieval projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | mteb | 1,395 |
2 | beir | 1,388 |
3 | R2R | 1,202 |
4 | RETRO-pytorch | 827 |
5 | fastembed | 796 |
6 | NeumAI | 779 |
7 | memorizing-transformers-pytorch | 611 |
8 | searchGPT | 570 |
9 | raptor | 450 |
10 | cherche | 313 |
11 | ACT | 126 |
12 | icl-ceil | 81 |
13 | retomaton | 64 |
14 | ragswift | 33 |
15 | SHREC2023-ANIMAR | 6 |
16 | FloridaPropertyData | 1 |
Sponsored