SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Retrieval Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
-
NeumAI
Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
memorizing-transformers-pytorch
Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch
-
searchGPT
Grounded search engine (i.e. with source reference) based on LLM / ChatGPT / OpenAI API. It supports web search, file content search etc.
-
raptor
The official implementation of RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
-
indexify
A scalable realtime and continuous indexing and structured extraction engine for Unstructured Data to build Generative AI Applications
-
ACT
Atmospheric data Community Toolkit - A python based toolkit for exploring and analyzing time series atmospheric datasets (by ARM-DOE)
-
MoTIS
[NAACL 2022]Mobile Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP) (by DRSY)
-
retomaton
PyTorch code for the RetoMaton paper: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022)
-
ragswift
🚀 Scale your RAG pipeline using Ragswift: A scalable centralized embeddings management platform
-
SHREC2023-ANIMAR
Source codes of team TikTorch (1st place solution) for track 2 and 3 of the SHREC2023 Challenge
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:
- Chunking can interfer with context boundaries
- Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)
- Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)
- RAG will miserably fail with requests like "summarize the whole document"
- to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb
1 https://github.com/underlines/awesome-marketing-datascience/...
The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard
Project mention: Show HN: R2R – Open-source framework for production-grade RAG | news.ycombinator.com | 2024-02-26
Project mention: FastLLM by Qdrant – lightweight LLM tailored For RAG | news.ycombinator.com | 2024-04-01
Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | news.ycombinator.com | 2023-11-21Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...
At one point I experimented a little with transformers that had access to external memory searchable via KNN lookups https://github.com/lucidrains/memorizing-transformers-pytorc... or via routed queries with https://github.com/glassroom/heinsen_routing . Both approaches seemed to work for me, but I had to put that work on hold for reasons outside my control.
Project mention: RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation | news.ycombinator.com | 2024-04-30Worth a comparison with RAPTOR, another tiered RAG system.
https://arxiv.org/abs/2401.18059
If you are interested, you can check out the documentation here: https://github.com/raphaelsty/cherche
Around 2 weeks ago now, someone opened an issue on OasysDB to integrate it to his platform, Indexify, an open-source platform to extract and process various unstructured data from different sources for generative AI apps in real-time.
Project mention: Embeddings are a good starting point for the AI curious app developer | news.ycombinator.com | 2024-04-17Yes, I use fastembed-rs[1] in a project I'm working on and it runs flawlessly. You can store the embeddings in any boring database, but for fast vector math, a vector database is recommended (e.g. the pgvector postgres extension).
[1] https://github.com/Anush008/fastembed-rs
Project mention: Show HN: Axilla – Open-source TypeScript framework for LLM apps | news.ycombinator.com | 2023-08-07Hi HN, we are Nick and Ben, creators of Axilla.
Axilla is an open source TypeScript framework to develop LLM applications.
It’s in the early stages but you can use it today: we’ve already published 2 modules and have more coming soon!
Ben and I met while working at Cruise on the ML platform for self-driving cars. We spent many years there and learned the hard way that shipping AI is not quite the same as shipping regular code. There are many parts of the ML lifecycle, e.g., mining, processing, and labeling data and training, evaluating, and deploying models. Although none of them are rocket science, most of the inefficiencies tend to come from integrating them together. At Cruise, we built an integrated framework that accelerated the speed of shipping models to the car by 80%.
With the explosion of generative AI, we are seeing software teams building applications and features with the same inefficiencies we experienced at Cruise.
This got us excited about building an opinionated, end-to-end platform. We started building in Python but quickly noticed that most of the teams we talked to weren’t using Python, but instead building in TypeScript. This is because most teams are not training their own models, but rather using foundational ones served by third parties over HTTP, like openAI, anthropic or even OSS ones from hugging face.
Because of this, we’ve decided to build Axilla as a TypeScript first library.
Our goal is to build a modular framework that can be adopted incrementally yet benefits from full integration. For example, the production responses coming from the LLM should be able to be sent — with all necessary metadata — to the eval module or the labeling tooling.
So far, we’ve shipped 2 modules, that are available to use today on npm:
* *axgen*: focused on RAG type workflows. Useful if you want to ingest data, get the embeddings, store it in a vector store and then do similarity search retrieval. It’s how you give LLMs memory or more context about private data sources.
* *axeval*: a lightweight evaluation library, that feels like jest (so, like unit tests). In our experience, evaluation should be really easy to setup, to encourage continuous quality monitoring, and slowly build ground truth datasets of edge cases that can be used for regression testing, and fine-tuning.
We are working on a serving module and a data processing one next and would love to hear what functionality you need us to prioritize!
We built an open-source demo UI for you to discover the framework more: https://github.com/axilla-io/demo-ui
And here's a video of Nicholas walking through the UI that gives an idea of what axgen can do: https://www.loom.com/share/458f9b6679b740f0a5c78a33fffee3dc
We’d love to hear your feedback on the framework, you can let us know here, create an issue on the GitHub repo or send me an email at [email protected]
And of course, contributions welcome!
Project mention: SeekStorm VS tantivy - a user suggested alternative | libhunt.com/r/SeekStorm | 2024-03-22
Project mention: Show HN: Ragswift – Scalable embeddings platform powered by distributed compute | news.ycombinator.com | 2024-01-22
Retrieval related posts
-
How I got my first Rust job by doing open-source
-
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
-
FastLLM by Qdrant – lightweight LLM tailored For RAG
-
Indexify -Scalable, realtime, continuous indexing engine–Unstructured Data to AI
-
What are Vector Embeddings?
-
[D] Any pre trained retrieval based language models available?
-
[D] Is there an open-source implementation of the Retrieval-Enhanced Transformer (RETRO)?
-
A note from our sponsor - SaaSHub
www.saashub.com | 2 May 2024
Index
What are some of the best open-source Retrieval projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Apache Lucene | 2,147 |
2 | mteb | 1,395 |
3 | beir | 1,388 |
4 | R2R | 1,202 |
5 | RETRO-pytorch | 827 |
6 | fastembed | 781 |
7 | NeumAI | 779 |
8 | awesome-local-global-descriptor | 637 |
9 | memorizing-transformers-pytorch | 609 |
10 | searchGPT | 570 |
11 | raptor | 450 |
12 | cherche | 313 |
13 | indexify | 238 |
14 | fastembed-rs | 150 |
15 | ACT | 126 |
16 | MoTIS | 115 |
17 | icl-ceil | 81 |
18 | original-demo-ui | 70 |
19 | retomaton | 64 |
20 | SeekStorm | 43 |
21 | BuRR | 34 |
22 | ragswift | 33 |
23 | SHREC2023-ANIMAR | 6 |
Sponsored