Top 23 Python semantic-search Projects

MindsDB

78 21,223 10.0 Python

The platform for customizing AI from enterprise data

Project mention: What’s the Difference Between Fine-tuning, Retraining, and RAG? | dev.to | 2024-04-08

Check us out on GitHub.

haystack

54 13,633 9.9 Python

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

Project mention: Release Radar • March 2024 Edition | dev.to | 2024-04-07

View on GitHub

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
txtai

354 6,953 9.3 Python

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

Project mention: Build knowledge graphs with LLM-driven entity extraction | dev.to | 2024-02-21

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

GPTCache

43 6,406 8.7 Python

Semantic cache for LLMs. Fully integrated with LangChain and llama_index.

Project mention: Ask HN: What are the drawbacks of caching LLM responses? | news.ycombinator.com | 2024-03-15

Just found this: https://github.com/zilliztech/GPTCache which seems to address this idea/issue.

khoj

50 4,786 9.9 Python

Your AI second brain. A copilot to get answers to your questions, whether they be from your own notes or from the internet. Use powerful, online (e.g gpt4) or private, local (e.g mistral) LLMs. Self-host locally or use our web app. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.

Project mention: Show HN: I made an app to use local AI as daily driver | news.ycombinator.com | 2024-02-27

There are already several RAG chat open source solutions available. Two that immediately come to mind are:
Danswer
https://github.com/danswer-ai/danswer
Khoj
https://github.com/khoj-ai/khoj

marqo

114 4,111 9.3 Python

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

Project mention: Are we at peak vector database? | news.ycombinator.com | 2024-01-25

We (Marqo) are doing a lot on 1 and 2. There is a huge amount to be done on the ML side of vector search and we are investing heavily in it. I think it has not quite sunk in that vector search systems are ML systems and everything that comes with that. I would love to chat about 1 and 2 so feel free to email me (email is in my profile). What we have done so far is here -> https://github.com/marqo-ai/marqo

llmware

9 3,086 9.8 Python

Providing enterprise-grade LLM-based development framework, tools, and fine-tuned models.

Project mention: More Agents Is All You Need: LLMs performance scales with the number of agents | news.ycombinator.com | 2024-04-06

I couldn't agree more. You should check out LLMWare's SLIM agents (https://github.com/llmware-ai/llmware/tree/main/examples/SLI...). It's focusing on pretty much exactly this and chaining multiple local LLMs together.
A really good topic that ties in with this is the need for deterministic sampling (I may have the terminology a bit incorrect) depending on what the model is indended for. The LLMWare team did a good 2 part video on this here as well (https://www.youtube.com/watch?v=7oMTGhSKuNY)
I think dedicated miniture LLMs are the way forward.
Disclaimer - Not affiliated with them in any way, just think it's a really cool project.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Top2Vec

13 2,839 7.0 Python

Top2Vec learns jointly embedded topic, document and word vectors.

Project mention: [D] Is it better to create a different set of Doc2Vec embeddings for each group in my dataset, rather than generating embeddings for the entire dataset? | /r/MachineLearning | 2023-10-28

I'm using Top2Vec with Doc2Vec embeddings to find topics in a dataset of ~4000 social media posts. This dataset has three groups:

docarray

32 2,739 9.2 Python

Represent, send, store and search multimodal data

Project mention: DocArray – Represent, send, and store multimodal data for ML | news.ycombinator.com | 2023-04-27

mteb

2 1,372 9.1 Python

MTEB: Massive Text Embedding Benchmark

Project mention: AI for AWS Documentation | news.ycombinator.com | 2023-07-06

RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:
- Chunking can interfer with context boundaries
- Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)
- Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)
- RAG will miserably fail with requests like "summarize the whole document"
- to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb
1 https://github.com/underlines/awesome-marketing-datascience/...

uform

8 865 8.2 Python

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️

Project mention: CatLIP: Clip Vision Accuracy with 2.7x Faster Pre-Training on Web-Scale Data | news.ycombinator.com | 2024-04-25

question: any good on-device size image embedding models?
tried https://github.com/unum-cloud/uform which i do like, especially they also support languages other than English. Any recommendations on other alternatives?

primeqa

5 698 8.8 Python

The prime repository for state-of-the-art Multilingual Question Answering research and development.

Project mention: State-of-the-Art Multilingual Question Answering | /r/aiengineer | 2023-07-10

cherche

12 311 4.4 Python

Neural Search

Project mention: [P] Semantic search | /r/MachineLearning | 2023-05-08

If you are interested, you can check out the documentation here: https://github.com/raphaelsty/cherche

neural-cherche

2 295 8.1 Python

Neural Search

Project mention: [P] Introducing Neural-Cherche: Enhance Document Retrieval with Advanced AI Models | /r/MachineLearning | 2023-11-19

I'm excited to share a tool I've developed called Neural-Cherche. Its main purpose is to transform a Sentence Transformer into a ColBERT model, which is currently at the forefront of information retrieval tools.

CX_DB8

4 222 0.0 Python

a contextual, biasable, word-or-sentence-or-paragraph extractive summarizer powered by the latest in text embeddings (Bert, Universal Sentence Encoder, Flair)

Project mention: Ask HN: What have you built with LLMs? | news.ycombinator.com | 2024-02-05

I was working on this stuff before it was cool, so in the sense of the precursor to LLMs (and sometimes supporting LLMs still) I've built many things:
1. Games you can play with word2vec or related models (could be drop in replaced with sentence transformer). It's crazy that this is 5 years old now: https://github.com/Hellisotherpeople/Language-games
2. "Constrained Text Generation Studio" - A research project I wrote when I was trying to solve LLM's inability to follow syntactic, phonetic, or semantic constraints: https://github.com/Hellisotherpeople/Constrained-Text-Genera...
3. DebateKG - A bunch of "Semantic Knowledge Graphs" built on my pet debate evidence dataset (LLM backed embeddings indexes synchronized with a graphDB and a sqlDB via txtai). Can create compelling policy debate cases https://github.com/Hellisotherpeople/DebateKG
4. My failed attempt at a good extractive summarizer. My life work is dedicated to one day solving the problems I tried to fix with this project: https://github.com/Hellisotherpeople/CX_DB8

HyperTag

12 180 4.1 Python

HyperTag - Intuitive Knowledge Management WebApp & CLI for Humans using Deep Learning & Tags
bert-solr-search

2 160 2.4 Python

Search with BERT vectors in Solr, Elasticsearch, OpenSearch and GSI APU
sycamore

1 152 9.6 Python

🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data. (by aryn-ai)

Project mention: Show HN: Sycamore – an LLM-powered semantic data preparation system for search | news.ycombinator.com | 2023-09-29

semantic-search-app-template

2 109 3.8 Python

Tutorial and template for a semantic search app powered by the Atlas Embedding Database, Langchain, OpenAI and FastAPI
DocumentGPT

1 98 8.1 Python

DocumentGPT is a web application that allows you to chat over your research document using OpenAI's chat API and perform semantic search using vector databases. This tool provides a seamless interface for interacting with your research document, exploring search results, and engaging in a conversation with an AI chatbot.

Project mention: DocumentGPT with Agents | /r/StreamlitOfficial | 2023-07-07

Was really excited to get everything working! Check it out at: https://github.com/aju22/DocumentGPT

citrus

1 92 7.6 Python

(distributed) vector database (by 0xDebabrata)

Project mention: Created a smol vector database in my free time. Looking to provide a LangChain integration soon! | /r/LangChain | 2023-05-06

It supports all the basic features like creating an index, inserting vectors and searching through them. Here's the GitHub link if anyone's interested in going over it: https://github.com/0xDebabrata/citrus

abstracts-search

1 66 6.0 Python

Semantic search engine indexing 95 million academic publications

Project mention: [P] abstracts-search: A semantic search engine indexing 95 million academic publications | /r/MachineLearning | 2023-05-15

I'm releasing the entire project as open code and open data. All ~600 lines of Python, 69 GB in embeddings, and raw faiss index can be found through https://github.com/colonelwatch/abstracts-search

NLP-Guide

2 64 3.5 Python

Natural Language Processing (NLP). Covering topics such as Tokenization, Part Of Speech tagging (POS), Machine translation, Named Entity Recognition (NER), Classification, and Sentiment analysis.
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python semantic-search related posts

Build knowledge graphs with LLM-driven entity extraction
1 project | dev.to | 21 Feb 2024
How to Build a Semantic Search Engine for Emojis
1 project | dev.to | 7 Feb 2024
Bootstrap or VC?
1 project | news.ycombinator.com | 5 Feb 2024
txtai: An embeddings database for semantic search, graph networks and RAG
1 project | news.ycombinator.com | 3 Feb 2024
Are we at peak vector database?
8 projects | news.ycombinator.com | 25 Jan 2024
Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?
12 projects | news.ycombinator.com | 24 Dec 2023
Open source alternative to ChatGPT and ChatPDF-like AI tools
6 projects | news.ycombinator.com | 9 Dec 2023
A note from our sponsor - InfluxDB
www.influxdata.com | 26 Apr 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source semantic-search projects in Python? This list will help you:

	Project	Stars
1	MindsDB	21,223
2	haystack	13,633
3	txtai	6,953
4	GPTCache	6,406
5	khoj	4,786
6	marqo	4,111
7	llmware	3,086
8	Top2Vec	2,839
9	docarray	2,739
10	mteb	1,372
11	uform	865
12	primeqa	698
13	cherche	311
14	neural-cherche	295
15	CX_DB8	222
16	HyperTag	180
17	bert-solr-search	160
18	sycamore	152
19	semantic-search-app-template	109
20	DocumentGPT	98
21	citrus	92
22	abstracts-search	66
23	NLP-Guide	64