Retrieval

Top 23 Retrieval Open-Source Projects

  • Apache Lucene

    Apache Lucene.NET

  • mteb

    MTEB: Massive Text Embedding Benchmark

  • Project mention: AI for AWS Documentation | news.ycombinator.com | 2023-07-06

    RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:

    - Chunking can interfer with context boundaries

    - Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)

    - Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)

    - RAG will miserably fail with requests like "summarize the whole document"

    - to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb

    1 https://github.com/underlines/awesome-marketing-datascience/...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • beir

    A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

  • Project mention: On building a semantic search engine | news.ycombinator.com | 2024-01-06

    The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard

  • R2R

    The framework for fast development and deployment of RAG systems. (by SciPhi-AI)

  • Project mention: Show HN: R2R – Open-source framework for production-grade RAG | news.ycombinator.com | 2024-02-26
  • RETRO-pytorch

    Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch

  • fastembed

    Fast, Accurate, Lightweight Python library to make State of the Art Embedding

  • Project mention: FastLLM by Qdrant – lightweight LLM tailored For RAG | news.ycombinator.com | 2024-04-01
  • NeumAI

    Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.

  • Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | news.ycombinator.com | 2023-11-21

    Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • awesome-local-global-descriptor

    My personal note about local and global descriptor

  • memorizing-transformers-pytorch

    Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

  • Project mention: What can LLMs never do? | news.ycombinator.com | 2024-04-27

    At one point I experimented a little with transformers that had access to external memory searchable via KNN lookups https://github.com/lucidrains/memorizing-transformers-pytorc... or via routed queries with https://github.com/glassroom/heinsen_routing . Both approaches seemed to work for me, but I had to put that work on hold for reasons outside my control.

  • searchGPT

    Grounded search engine (i.e. with source reference) based on LLM / ChatGPT / OpenAI API. It supports web search, file content search etc.

  • raptor

    The official implementation of RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

  • Project mention: RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation | news.ycombinator.com | 2024-04-30

    Worth a comparison with RAPTOR, another tiered RAG system.

    https://arxiv.org/abs/2401.18059

  • cherche

    Neural Search

  • Project mention: [P] Semantic search | /r/MachineLearning | 2023-05-08

    If you are interested, you can check out the documentation here: https://github.com/raphaelsty/cherche

  • indexify

    A scalable realtime and continuous indexing and structured extraction engine for Unstructured Data to build Generative AI Applications

  • Project mention: How I got my first Rust job by doing open-source | dev.to | 2024-04-30

    Around 2 weeks ago now, someone opened an issue on OasysDB to integrate it to his platform, Indexify, an open-source platform to extract and process various unstructured data from different sources for generative AI apps in real-time.

  • fastembed-rs

    Library to generate vector embeddings. Rust implementation of Qdrant's FastEmbed.

  • Project mention: Embeddings are a good starting point for the AI curious app developer | news.ycombinator.com | 2024-04-17

    Yes, I use fastembed-rs[1] in a project I'm working on and it runs flawlessly. You can store the embeddings in any boring database, but for fast vector math, a vector database is recommended (e.g. the pgvector postgres extension).

    [1] https://github.com/Anush008/fastembed-rs

  • ACT

    Atmospheric data Community Toolkit - A python based toolkit for exploring and analyzing time series atmospheric datasets (by ARM-DOE)

  • MoTIS

    [NAACL 2022]Mobile Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP) (by DRSY)

  • icl-ceil

    [ICML 2023] Code for our paper “Compositional Exemplars for In-context Learning”.

  • original-demo-ui

    Demo UI for the axgen library

  • Project mention: Show HN: Axilla – Open-source TypeScript framework for LLM apps | news.ycombinator.com | 2023-08-07

    Hi HN, we are Nick and Ben, creators of Axilla.

    Axilla is an open source TypeScript framework to develop LLM applications.

    It’s in the early stages but you can use it today: we’ve already published 2 modules and have more coming soon!

    Ben and I met while working at Cruise on the ML platform for self-driving cars. We spent many years there and learned the hard way that shipping AI is not quite the same as shipping regular code. There are many parts of the ML lifecycle, e.g., mining, processing, and labeling data and training, evaluating, and deploying models. Although none of them are rocket science, most of the inefficiencies tend to come from integrating them together. At Cruise, we built an integrated framework that accelerated the speed of shipping models to the car by 80%.

    With the explosion of generative AI, we are seeing software teams building applications and features with the same inefficiencies we experienced at Cruise.

    This got us excited about building an opinionated, end-to-end platform. We started building in Python but quickly noticed that most of the teams we talked to weren’t using Python, but instead building in TypeScript. This is because most teams are not training their own models, but rather using foundational ones served by third parties over HTTP, like openAI, anthropic or even OSS ones from hugging face.

    Because of this, we’ve decided to build Axilla as a TypeScript first library.

    Our goal is to build a modular framework that can be adopted incrementally yet benefits from full integration. For example, the production responses coming from the LLM should be able to be sent — with all necessary metadata — to the eval module or the labeling tooling.

    So far, we’ve shipped 2 modules, that are available to use today on npm:

    * *axgen*: focused on RAG type workflows. Useful if you want to ingest data, get the embeddings, store it in a vector store and then do similarity search retrieval. It’s how you give LLMs memory or more context about private data sources.

    * *axeval*: a lightweight evaluation library, that feels like jest (so, like unit tests). In our experience, evaluation should be really easy to setup, to encourage continuous quality monitoring, and slowly build ground truth datasets of edge cases that can be used for regression testing, and fine-tuning.

    We are working on a serving module and a data processing one next and would love to hear what functionality you need us to prioritize!

    We built an open-source demo UI for you to discover the framework more: https://github.com/axilla-io/demo-ui

    And here's a video of Nicholas walking through the UI that gives an idea of what axgen can do: https://www.loom.com/share/458f9b6679b740f0a5c78a33fffee3dc

    We’d love to hear your feedback on the framework, you can let us know here, create an issue on the GitHub repo or send me an email at [email protected]

    And of course, contributions welcome!

  • retomaton

    PyTorch code for the RetoMaton paper: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022)

  • SeekStorm

    SeekStorm - sub-millisecond full-text search library & multi-tenancy server in Rust

  • Project mention: SeekStorm VS tantivy - a user suggested alternative | libhunt.com/r/SeekStorm | 2024-03-22
  • BuRR

    Bumped Ribbon Retrieval and Approximate Membership Query (by lorenzhs)

  • ragswift

    🚀 Scale your RAG pipeline using Ragswift: A scalable centralized embeddings management platform

  • Project mention: Show HN: Ragswift – Scalable embeddings platform powered by distributed compute | news.ycombinator.com | 2024-01-22
  • SHREC2023-ANIMAR

    Source codes of team TikTorch (1st place solution) for track 2 and 3 of the SHREC2023 Challenge

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Retrieval related posts

  • How I got my first Rust job by doing open-source

    3 projects | dev.to | 30 Apr 2024
  • RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

    1 project | news.ycombinator.com | 30 Apr 2024
  • FastLLM by Qdrant – lightweight LLM tailored For RAG

    1 project | news.ycombinator.com | 1 Apr 2024
  • Indexify -Scalable, realtime, continuous indexing engine–Unstructured Data to AI

    1 project | news.ycombinator.com | 6 Mar 2024
  • What are Vector Embeddings?

    1 project | dev.to | 7 Feb 2024
  • [D] Any pre trained retrieval based language models available?

    3 projects | /r/MachineLearning | 22 Oct 2022
  • [D] Is there an open-source implementation of the Retrieval-Enhanced Transformer (RETRO)?

    4 projects | /r/MachineLearning | 15 Jan 2022
  • A note from our sponsor - SaaSHub
    www.saashub.com | 2 May 2024
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Retrieval projects? This list will help you:

Project Stars
1 Apache Lucene 2,147
2 mteb 1,395
3 beir 1,388
4 R2R 1,202
5 RETRO-pytorch 827
6 fastembed 781
7 NeumAI 779
8 awesome-local-global-descriptor 637
9 memorizing-transformers-pytorch 609
10 searchGPT 570
11 raptor 450
12 cherche 313
13 indexify 238
14 fastembed-rs 150
15 ACT 126
16 MoTIS 115
17 icl-ceil 81
18 original-demo-ui 70
19 retomaton 64
20 SeekStorm 43
21 BuRR 34
22 ragswift 33
23 SHREC2023-ANIMAR 6

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com