SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 semantic-search Open-Source Projects
-
Typesense
Open Source alternative to Algolia + Pinecone and an Easier-to-Use alternative to ElasticSearch ⚡ 🔍 ✨ Fast, typo tolerant, in-memory fuzzy Search Engine for building delightful search experiences
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
haystack
:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
-
Weaviate
Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database.
-
txtai
💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
-
khoj
Your AI second brain. A copilot to get answers to your questions, whether they be from your own notes or from the internet. Use powerful, online (e.g gpt4) or private, local (e.g mistral) LLMs. Self-host locally or use our web app. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
awesome-generative-ai
A curated list of Generative AI tools, works, models, and references (by filipecalegario)
-
usearch
Fast Open-Source Search & Clustering engine × for Vectors & 🔜 Strings × in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram 🔍
-
kernel-memory
Index and query any data using LLM and natural language, tracking sources and showing citations.
-
uform
Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
-
primeqa
The prime repository for state-of-the-art Multilingual Question Answering research and development.
-
miyagi
Sample to envision intelligent apps with Microsoft's Copilot stack for AI-infused product experiences.
-
elastiknn
Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.
-
awesome-semantic-search
A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: What’s the Difference Between Fine-tuning, Retraining, and RAG? | dev.to | 2024-04-08Check us out on GitHub.
There are actually plenty of non-ES products that are way easier to integrate and tune (and get better results with less effort).
- Typesense (https://github.com/typesense/typesense)
- Algolia
- Google Programmable Search Engine (https://programmablesearchengine.google.com/about/)
View on GitHub
Project mention: pgvecto.rs alternatives - qdrant and Weaviate | libhunt.com/r/pgvecto.rs | 2024-03-13
txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.
Project mention: Ask HN: What are the drawbacks of caching LLM responses? | news.ycombinator.com | 2024-03-15Just found this: https://github.com/zilliztech/GPTCache which seems to address this idea/issue.
Project mention: Show HN: I made an app to use local AI as daily driver | news.ycombinator.com | 2024-02-27There are already several RAG chat open source solutions available. Two that immediately come to mind are:
Danswer
https://github.com/danswer-ai/danswer
Khoj
We (Marqo) are doing a lot on 1 and 2. There is a huge amount to be done on the ML side of vector search and we are investing heavily in it. I think it has not quite sunk in that vector search systems are ML systems and everything that comes with that. I would love to chat about 1 and 2 so feel free to email me (email is in my profile). What we have done so far is here -> https://github.com/marqo-ai/marqo
Project mention: More Agents Is All You Need: LLMs performance scales with the number of agents | news.ycombinator.com | 2024-04-06I couldn't agree more. You should check out LLMWare's SLIM agents (https://github.com/llmware-ai/llmware/tree/main/examples/SLI...). It's focusing on pretty much exactly this and chaining multiple local LLMs together.
A really good topic that ties in with this is the need for deterministic sampling (I may have the terminology a bit incorrect) depending on what the model is indended for. The LLMWare team did a good 2 part video on this here as well (https://www.youtube.com/watch?v=7oMTGhSKuNY)
I think dedicated miniture LLMs are the way forward.
Disclaimer - Not affiliated with them in any way, just think it's a really cool project.
Project mention: [D] Is it better to create a different set of Doc2Vec embeddings for each group in my dataset, rather than generating embeddings for the entire dataset? | /r/MachineLearning | 2023-10-28I'm using Top2Vec with Doc2Vec embeddings to find topics in a dataset of ~4000 social media posts. This dataset has three groups:
Project mention: DocArray – Represent, send, and store multimodal data for ML | news.ycombinator.com | 2023-04-27
Project mention: I’m working on making a ChatGPT app with long term memory | /r/ChatGPTCoding | 2023-04-24
Project mention: Generative AI – A curated list of Generative AI tools, works, models | news.ycombinator.com | 2023-07-14
Project mention: USearch SQLite Extensions for Vector and Text Search | news.ycombinator.com | 2024-02-22
RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:
- Chunking can interfer with context boundaries
- Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)
- Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)
- RAG will miserably fail with requests like "summarize the whole document"
- to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb
1 https://github.com/underlines/awesome-marketing-datascience/...
Project mention: Open source alternative to ChatGPT and ChatPDF-like AI tools | news.ycombinator.com | 2023-12-09about #3 I’ll recommend https://github.com/microsoft/kernel-memory :)
Project mention: Show HN: UForm v2 Featuring Multimodal Matryoshka, Multimodal DPO, and ONNX | news.ycombinator.com | 2024-03-28
semantic-search related posts
- Build knowledge graphs with LLM-driven entity extraction
- How to Build a Semantic Search Engine for Emojis
- Bootstrap or VC?
- txtai: An embeddings database for semantic search, graph networks and RAG
- Are we at peak vector database?
- Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?
- Open source alternative to ChatGPT and ChatPDF-like AI tools
-
A note from our sponsor - SaaSHub
www.saashub.com | 19 Apr 2024
Index
What are some of the best open-source semantic-search projects? This list will help you:
Project | Stars | |
---|---|---|
1 | MindsDB | 21,160 |
2 | Typesense | 17,796 |
3 | haystack | 13,564 |
4 | Weaviate | 9,436 |
5 | txtai | 6,910 |
6 | GPTCache | 6,387 |
7 | khoj | 4,760 |
8 | marqo | 4,086 |
9 | llmware | 3,056 |
10 | databerry | 2,857 |
11 | Top2Vec | 2,833 |
12 | docarray | 2,730 |
13 | examples | 2,396 |
14 | clip-retrieval | 2,115 |
15 | awesome-generative-ai | 1,957 |
16 | usearch | 1,611 |
17 | mteb | 1,314 |
18 | kernel-memory | 1,150 |
19 | uform | 859 |
20 | primeqa | 696 |
21 | miyagi | 610 |
22 | elastiknn | 352 |
23 | awesome-semantic-search | 319 |