Top 23 semantic-search Open-Source Projects

MindsDB

78 21,160 10.0 Python

The platform for customizing AI from enterprise data

Project mention: What’s the Difference Between Fine-tuning, Retraining, and RAG? | dev.to | 2024-04-08

Check us out on GitHub.

Typesense

129 17,796 9.8 C++

Open Source alternative to Algolia + Pinecone and an Easier-to-Use alternative to ElasticSearch ⚡ 🔍 ✨ Fast, typo tolerant, in-memory fuzzy Search Engine for building delightful search experiences

Project mention: Website Search Hurts My Feelings | news.ycombinator.com | 2023-12-26

There are actually plenty of non-ES products that are way easier to integrate and tune (and get better results with less effort).
- Typesense (https://github.com/typesense/typesense)
- Algolia
- Google Programmable Search Engine (https://programmablesearchengine.google.com/about/)

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
haystack

54 13,564 9.9 Python

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

Project mention: Release Radar • March 2024 Edition | dev.to | 2024-04-07

View on GitHub

Weaviate

76 9,436 10.0 Go

Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database.

Project mention: pgvecto.rs alternatives - qdrant and Weaviate | libhunt.com/r/pgvecto.rs | 2024-03-13

txtai

354 6,910 9.3 Python

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

Project mention: Build knowledge graphs with LLM-driven entity extraction | dev.to | 2024-02-21

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

GPTCache

43 6,387 8.7 Python

Semantic cache for LLMs. Fully integrated with LangChain and llama_index.

Project mention: Ask HN: What are the drawbacks of caching LLM responses? | news.ycombinator.com | 2024-03-15

Just found this: https://github.com/zilliztech/GPTCache which seems to address this idea/issue.

khoj

50 4,760 9.9 Python

Your AI second brain. A copilot to get answers to your questions, whether they be from your own notes or from the internet. Use powerful, online (e.g gpt4) or private, local (e.g mistral) LLMs. Self-host locally or use our web app. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.

Project mention: Show HN: I made an app to use local AI as daily driver | news.ycombinator.com | 2024-02-27

There are already several RAG chat open source solutions available. Two that immediately come to mind are:
Danswer
https://github.com/danswer-ai/danswer
Khoj
https://github.com/khoj-ai/khoj

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
marqo

114 4,086 9.3 Python

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

Project mention: Are we at peak vector database? | news.ycombinator.com | 2024-01-25

We (Marqo) are doing a lot on 1 and 2. There is a huge amount to be done on the ML side of vector search and we are investing heavily in it. I think it has not quite sunk in that vector search systems are ML systems and everything that comes with that. I would love to chat about 1 and 2 so feel free to email me (email is in my profile). What we have done so far is here -> https://github.com/marqo-ai/marqo

llmware

9 3,056 9.8 Python

Providing enterprise-grade LLM-based development framework, tools, and fine-tuned models.

Project mention: More Agents Is All You Need: LLMs performance scales with the number of agents | news.ycombinator.com | 2024-04-06

I couldn't agree more. You should check out LLMWare's SLIM agents (https://github.com/llmware-ai/llmware/tree/main/examples/SLI...). It's focusing on pretty much exactly this and chaining multiple local LLMs together.
A really good topic that ties in with this is the need for deterministic sampling (I may have the terminology a bit incorrect) depending on what the model is indended for. The LLMWare team did a good 2 part video on this here as well (https://www.youtube.com/watch?v=7oMTGhSKuNY)
I think dedicated miniture LLMs are the way forward.
Disclaimer - Not affiliated with them in any way, just think it's a really cool project.

databerry

35 2,857 9.9 TypeScript

The no-code platform for building custom LLM Agents

Project mention: Open-source platform to build custom ChatGPT Agents | /r/reactjs | 2023-06-17

Top2Vec

13 2,833 7.0 Python

Top2Vec learns jointly embedded topic, document and word vectors.

Project mention: [D] Is it better to create a different set of Doc2Vec embeddings for each group in my dataset, rather than generating embeddings for the entire dataset? | /r/MachineLearning | 2023-10-28

I'm using Top2Vec with Doc2Vec embeddings to find topics in a dataset of ~4000 social media posts. This dataset has three groups:

docarray

32 2,730 9.2 Python

Represent, send, store and search multimodal data

Project mention: DocArray – Represent, send, and store multimodal data for ML | news.ycombinator.com | 2023-04-27

examples

4 2,396 9.4 Jupyter Notebook

Jupyter Notebooks to help you get hands-on with Pinecone vector databases (by pinecone-io)

Project mention: I’m working on making a ChatGPT app with long term memory | /r/ChatGPTCoding | 2023-04-24

clip-retrieval

11 2,115 7.9 Jupyter Notebook

Easily compute clip embeddings and build a clip retrieval system with them

Project mention: FLaNK AI for 11 March 2024 | dev.to | 2024-03-11

awesome-generative-ai

5 1,957 9.5 Jupyter Notebook

A curated list of Generative AI tools, works, models, and references (by filipecalegario)

Project mention: Generative AI – A curated list of Generative AI tools, works, models | news.ycombinator.com | 2023-07-14

usearch

20 1,611 9.8 C++

Fast Open-Source Search & Clustering engine × for Vectors & 🔜 Strings × in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram 🔍

Project mention: USearch SQLite Extensions for Vector and Text Search | news.ycombinator.com | 2024-02-22

mteb

2 1,314 9.1 Python

MTEB: Massive Text Embedding Benchmark

Project mention: AI for AWS Documentation | news.ycombinator.com | 2023-07-06

RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:
- Chunking can interfer with context boundaries
- Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)
- Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)
- RAG will miserably fail with requests like "summarize the whole document"
- to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb
1 https://github.com/underlines/awesome-marketing-datascience/...

kernel-memory

3 1,150 9.6 C#

Index and query any data using LLM and natural language, tracking sources and showing citations.

Project mention: Open source alternative to ChatGPT and ChatPDF-like AI tools | news.ycombinator.com | 2023-12-09

about #3 I’ll recommend https://github.com/microsoft/kernel-memory :)

uform

6 859 8.2 Python

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️

Project mention: Show HN: UForm v2 Featuring Multimodal Matryoshka, Multimodal DPO, and ONNX | news.ycombinator.com | 2024-03-28

primeqa

5 696 8.8 Python

The prime repository for state-of-the-art Multilingual Question Answering research and development.

Project mention: State-of-the-Art Multilingual Question Answering | /r/aiengineer | 2023-07-10

miyagi

1 610 9.2 Jupyter Notebook

Sample to envision intelligent apps with Microsoft's Copilot stack for AI-infused product experiences.

Project mention: Project Miyagi – Financial Coach | news.ycombinator.com | 2023-05-09

elastiknn

1 352 8.7 Scala

Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.
awesome-semantic-search

3 319 5.7

A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-08.

semantic-search related posts

Build knowledge graphs with LLM-driven entity extraction
1 project | dev.to | 21 Feb 2024
How to Build a Semantic Search Engine for Emojis
1 project | dev.to | 7 Feb 2024
Bootstrap or VC?
1 project | news.ycombinator.com | 5 Feb 2024
txtai: An embeddings database for semantic search, graph networks and RAG
1 project | news.ycombinator.com | 3 Feb 2024
Are we at peak vector database?
8 projects | news.ycombinator.com | 25 Jan 2024
Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?
12 projects | news.ycombinator.com | 24 Dec 2023
Open source alternative to ChatGPT and ChatPDF-like AI tools
6 projects | news.ycombinator.com | 9 Dec 2023
A note from our sponsor - SaaSHub
www.saashub.com | 19 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source semantic-search projects? This list will help you:

	Project	Stars
1	MindsDB	21,160
2	Typesense	17,796
3	haystack	13,564
4	Weaviate	9,436
5	txtai	6,910
6	GPTCache	6,387
7	khoj	4,760
8	marqo	4,086
9	llmware	3,056
10	databerry	2,857
11	Top2Vec	2,833
12	docarray	2,730
13	examples	2,396
14	clip-retrieval	2,115
15	awesome-generative-ai	1,957
16	usearch	1,611
17	mteb	1,314
18	kernel-memory	1,150
19	uform	859
20	primeqa	696
21	miyagi	610
22	elastiknn	352
23	awesome-semantic-search	319