Top 23 Retrieval Open-Source Projects

Apache Lucene

7 2,147 8.2 C#

Apache Lucene.NET
mteb

2 1,395 9.8 Python

MTEB: Massive Text Embedding Benchmark

Project mention: AI for AWS Documentation | news.ycombinator.com | 2023-07-06

RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:
- Chunking can interfer with context boundaries
- Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)
- Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)
- RAG will miserably fail with requests like "summarize the whole document"
- to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb
1 https://github.com/underlines/awesome-marketing-datascience/...

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
beir

8 1,388 4.2 Python

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

Project mention: On building a semantic search engine | news.ycombinator.com | 2024-01-06

The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard

R2R

3 1,202 9.7 Python

The framework for fast development and deployment of RAG systems. (by SciPhi-AI)

Project mention: Show HN: R2R – Open-source framework for production-grade RAG | news.ycombinator.com | 2024-02-26

RETRO-pytorch

2 827 2.8 Python

Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch
fastembed

4 781 9.5 Python

Fast, Accurate, Lightweight Python library to make State of the Art Embedding

Project mention: FastLLM by Qdrant – lightweight LLM tailored For RAG | news.ycombinator.com | 2024-04-01

NeumAI

2 779 8.7 Python

Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.

Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | news.ycombinator.com | 2023-11-21

Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
awesome-local-global-descriptor

1 637 1.8

My personal note about local and global descriptor
memorizing-transformers-pytorch

5 609 2.6 Python

Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Project mention: What can LLMs never do? | news.ycombinator.com | 2024-04-27

At one point I experimented a little with transformers that had access to external memory searchable via KNN lookups https://github.com/lucidrains/memorizing-transformers-pytorc... or via routed queries with https://github.com/glassroom/heinsen_routing . Both approaches seemed to work for me, but I had to put that work on hold for reasons outside my control.

searchGPT

3 570 7.2 Python

Grounded search engine (i.e. with source reference) based on LLM / ChatGPT / OpenAI API. It supports web search, file content search etc.
raptor

2 450 6.1 Python

The official implementation of RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Project mention: RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation | news.ycombinator.com | 2024-04-30

Worth a comparison with RAPTOR, another tiered RAG system.
https://arxiv.org/abs/2401.18059

cherche

12 313 4.4 Python

Neural Search

Project mention: [P] Semantic search | /r/MachineLearning | 2023-05-08

If you are interested, you can check out the documentation here: https://github.com/raphaelsty/cherche

indexify

4 238 9.9 Rust

A scalable realtime and continuous indexing and structured extraction engine for Unstructured Data to build Generative AI Applications

Project mention: How I got my first Rust job by doing open-source | dev.to | 2024-04-30

Around 2 weeks ago now, someone opened an issue on OasysDB to integrate it to his platform, Indexify, an open-source platform to extract and process various unstructured data from different sources for generative AI apps in real-time.

fastembed-rs

1 150 8.8 Rust

Library to generate vector embeddings. Rust implementation of Qdrant's FastEmbed.

Project mention: Embeddings are a good starting point for the AI curious app developer | news.ycombinator.com | 2024-04-17

Yes, I use fastembed-rs[1] in a project I'm working on and it runs flawlessly. You can store the embeddings in any boring database, but for fast vector math, a vector database is recommended (e.g. the pgvector postgres extension).
[1] https://github.com/Anush008/fastembed-rs

ACT

5 126 8.8 Python

Atmospheric data Community Toolkit - A python based toolkit for exploring and analyzing time series atmospheric datasets (by ARM-DOE)
MoTIS

2 115 1.6 Swift

[NAACL 2022]Mobile Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP) (by DRSY)
icl-ceil

1 81 1.7 Python

[ICML 2023] Code for our paper “Compositional Exemplars for In-context Learning”.
original-demo-ui

1 70 7.4 TypeScript

Demo UI for the axgen library

Project mention: Show HN: Axilla – Open-source TypeScript framework for LLM apps | news.ycombinator.com | 2023-08-07

Hi HN, we are Nick and Ben, creators of Axilla.
Axilla is an open source TypeScript framework to develop LLM applications.
It’s in the early stages but you can use it today: we’ve already published 2 modules and have more coming soon!
Ben and I met while working at Cruise on the ML platform for self-driving cars. We spent many years there and learned the hard way that shipping AI is not quite the same as shipping regular code. There are many parts of the ML lifecycle, e.g., mining, processing, and labeling data and training, evaluating, and deploying models. Although none of them are rocket science, most of the inefficiencies tend to come from integrating them together. At Cruise, we built an integrated framework that accelerated the speed of shipping models to the car by 80%.
With the explosion of generative AI, we are seeing software teams building applications and features with the same inefficiencies we experienced at Cruise.
This got us excited about building an opinionated, end-to-end platform. We started building in Python but quickly noticed that most of the teams we talked to weren’t using Python, but instead building in TypeScript. This is because most teams are not training their own models, but rather using foundational ones served by third parties over HTTP, like openAI, anthropic or even OSS ones from hugging face.
Because of this, we’ve decided to build Axilla as a TypeScript first library.
Our goal is to build a modular framework that can be adopted incrementally yet benefits from full integration. For example, the production responses coming from the LLM should be able to be sent — with all necessary metadata — to the eval module or the labeling tooling.
So far, we’ve shipped 2 modules, that are available to use today on npm:
* *axgen*: focused on RAG type workflows. Useful if you want to ingest data, get the embeddings, store it in a vector store and then do similarity search retrieval. It’s how you give LLMs memory or more context about private data sources.
* *axeval*: a lightweight evaluation library, that feels like jest (so, like unit tests). In our experience, evaluation should be really easy to setup, to encourage continuous quality monitoring, and slowly build ground truth datasets of edge cases that can be used for regression testing, and fine-tuning.
We are working on a serving module and a data processing one next and would love to hear what functionality you need us to prioritize!
We built an open-source demo UI for you to discover the framework more: https://github.com/axilla-io/demo-ui
And here's a video of Nicholas walking through the UI that gives an idea of what axgen can do: https://www.loom.com/share/458f9b6679b740f0a5c78a33fffee3dc
We’d love to hear your feedback on the framework, you can let us know here, create an issue on the GitHub repo or send me an email at [email protected]
And of course, contributions welcome!

retomaton

1 64 0.0 Python

PyTorch code for the RetoMaton paper: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022)
SeekStorm

1 43 7.7 Rust

SeekStorm - sub-millisecond full-text search library & multi-tenancy server in Rust

Project mention: SeekStorm VS tantivy - a user suggested alternative | libhunt.com/r/SeekStorm | 2024-03-22

BuRR

1 34 3.8 C++

Bumped Ribbon Retrieval and Approximate Membership Query (by lorenzhs)
ragswift

1 33 8.0 Python

🚀 Scale your RAG pipeline using Ragswift: A scalable centralized embeddings management platform

Project mention: Show HN: Ragswift – Scalable embeddings platform powered by distributed compute | news.ycombinator.com | 2024-01-22

SHREC2023-ANIMAR

1 6 6.6 Python

Source codes of team TikTorch (1st place solution) for track 2 and 3 of the SHREC2023 Challenge
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Retrieval related posts

How I got my first Rust job by doing open-source

3 projects | dev.to | 30 Apr 2024
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

1 project | news.ycombinator.com | 30 Apr 2024
FastLLM by Qdrant – lightweight LLM tailored For RAG

1 project | news.ycombinator.com | 1 Apr 2024
Indexify -Scalable, realtime, continuous indexing engine–Unstructured Data to AI

1 project | news.ycombinator.com | 6 Mar 2024
What are Vector Embeddings?

1 project | dev.to | 7 Feb 2024
[D] Any pre trained retrieval based language models available?

3 projects | /r/MachineLearning | 22 Oct 2022
[D] Is there an open-source implementation of the Retrieval-Enhanced Transformer (RETRO)?

4 projects | /r/MachineLearning | 15 Jan 2022
A note from our sponsor - SaaSHub
www.saashub.com | 2 May 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Retrieval projects? This list will help you:

	Project	Stars
1	Apache Lucene	2,147
2	mteb	1,395
3	beir	1,388
4	R2R	1,202
5	RETRO-pytorch	827
6	fastembed	781
7	NeumAI	779
8	awesome-local-global-descriptor	637
9	memorizing-transformers-pytorch	609
10	searchGPT	570
11	raptor	450
12	cherche	313
13	indexify	238
14	fastembed-rs	150
15	ACT	126
16	MoTIS	115
17	icl-ceil	81
18	original-demo-ui	70
19	retomaton	64
20	SeekStorm	43
21	BuRR	34
22	ragswift	33
23	SHREC2023-ANIMAR	6

Retrieval

Top 23 Retrieval Open-Source Projects

Retrieval related posts

How I got my first Rust job by doing open-source

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

FastLLM by Qdrant – lightweight LLM tailored For RAG

Indexify -Scalable, realtime, continuous indexing engine–Unstructured Data to AI

What are Vector Embeddings?

[D] Any pre trained retrieval based language models available?

[D] Is there an open-source implementation of the Retrieval-Enhanced Transformer (RETRO)?

Index