Top 23 Python semantic-search Projects

MindsDB

79 21,794 10.0 Python

The platform for customizing AI from enterprise data

Project mention: How to build your Developer Portfolio with MindsDB: The symbiotic relationship between developers and Opensource in 2024. | dev.to | 2024-05-23

Developers are able to check for issues to fix on MindsDB’s Github Issues Page. The issues are marked with labels which indicate what you can work on,which you can find here. Fixing bugs showcases that you are a problem solver and capable of resolving issues. Companies find this capability very valuable as it has an impact on the quality of their product and user experience.

Scout Monitoring

www.scoutapm.com featured

Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
haystack

55 14,279 9.9 Python

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

Project mention: Haystack DB – 10x faster than FAISS with binary embeddings by default | news.ycombinator.com | 2024-04-28

I was confused for a bit but there is no relation to https://haystack.deepset.ai/

khoj

50 11,317 9.9 Python

Your AI second brain. Get answers to your questions, whether they be online or in your own notes. Use online AI models (e.g gpt4) or private, local LLMs (e.g llama3). Self-host locally or use our cloud instance. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.

Project mention: Show HN: I made an app to use local AI as daily driver | news.ycombinator.com | 2024-02-27

There are already several RAG chat open source solutions available. Two that immediately come to mind are:
Danswer
https://github.com/danswer-ai/danswer
Khoj
https://github.com/khoj-ai/khoj

txtai

356 7,265 9.3 Python

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

Project mention: Show HN: FileKitty – Combine and label text files for LLM prompt contexts | news.ycombinator.com | 2024-05-01

GPTCache

43 6,595 7.7 Python

Semantic cache for LLMs. Fully integrated with LangChain and llama_index.

Project mention: Ask HN: What are the drawbacks of caching LLM responses? | news.ycombinator.com | 2024-03-15

Just found this: https://github.com/zilliztech/GPTCache which seems to address this idea/issue.

marqo

115 4,248 9.3 Python

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

Project mention: AI Search That Understands the Way Your Customer's Think | news.ycombinator.com | 2024-05-28

llmware

10 4,142 9.9 Python

Unified framework for building enterprise RAG pipelines with small, specialized models

Project mention: Natural Language Queries for SQL using SLIM | dev.to | 2024-06-05

If you made it this far, thank you for taking the time to go through this topic with us. For more content like this, make sure to visit our page at https://dev.to/llmware. The source code for this example and many more like it are on our GitHub at https://github.com/llmware-ai/llmware. Lastly, join our Discord to interact with a growing community of AI enthusiasts of all levels of experience at https://discord.gg/fCztJQeV7J!

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
lancedb

6 3,312 9.8 Python

Developer-friendly, serverless vector database for AI applications. Easily add long-term memory to your LLM apps!

Project mention: Open-source Rust-based RAG | news.ycombinator.com | 2024-03-10

There are much better known examples, such as https://qdrant.tech/ and https://github.com/lancedb/lancedb

Top2Vec

13 2,866 6.2 Python

Top2Vec learns jointly embedded topic, document and word vectors.

Project mention: [D] Is it better to create a different set of Doc2Vec embeddings for each group in my dataset, rather than generating embeddings for the entire dataset? | /r/MachineLearning | 2023-10-28

I'm using Top2Vec with Doc2Vec embeddings to find topics in a dataset of ~4000 social media posts. This dataset has three groups:

docarray

32 2,812 8.3 Python

Represent, send, store and search multimodal data
semantra

17 2,361 4.9 Python

Multi-tool for semantic search

Project mention: Semantra: Multi-Tool for Semantic Search | news.ycombinator.com | 2023-10-16

mteb

2 1,548 9.9 Python

MTEB: Massive Text Embedding Benchmark

Project mention: AI for AWS Documentation | news.ycombinator.com | 2023-07-06

RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:
- Chunking can interfer with context boundaries
- Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)
- Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)
- RAG will miserably fail with requests like "summarize the whole document"
- to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb
1 https://github.com/underlines/awesome-marketing-datascience/...

yt-fts

12 1,368 8.5 Python

YouTube Full Text Search - Search all of a YouTube channel from the command line

Project mention: Challenges with semantic search on transcribed audio files | news.ycombinator.com | 2023-12-27

I've been trying to solve a problem with implementing semantic search on my YouTube search engine yt-fts (https://github.com/NotJoeMartinez/yt-fts). I've managed to substantially speed up search results by storing subtitle embeddings in Chroma. But a bigger problem has been with how to properly segment the text in a way that accounts for the duration and context of word embeddings while returning precise time stamps. This a blog post exploring what I've tried so far.

uform

9 935 9.1 Python

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️

Project mention: Recapping the AI, Machine Learning and Data Science Meetup - May 30, 2024 | dev.to | 2024-06-04

UForm: Pocket-Sized Multimodal AI for Content Understanding and Generation

primeqa

6 709 7.5 Python

The prime repository for state-of-the-art Multilingual Question Answering research and development.

Project mention: Ask HN: Which LLMs can run locally on most consumer computers | news.ycombinator.com | 2024-05-21

There is actually a specific approach of this concept for generating synthetic data for training called UDAPDR[0].
It or something like it could likely be applied to any form of generation including what you are describing.
[0] - https://github.com/primeqa/primeqa/tree/4ae1b456dbe9f75276fe...

neural-cherche

2 317 8.4 Python

Neural Search

Project mention: [P] Introducing Neural-Cherche: Enhance Document Retrieval with Advanced AI Models | /r/MachineLearning | 2023-11-19

I'm excited to share a tool I've developed called Neural-Cherche. Its main purpose is to transform a Sentence Transformer into a ColBERT model, which is currently at the forefront of information retrieval tools.

cherche

12 316 3.8 Python

Neural Search
CX_DB8

4 222 0.0 Python

a contextual, biasable, word-or-sentence-or-paragraph extractive summarizer powered by the latest in text embeddings (Bert, Universal Sentence Encoder, Flair)

Project mention: Ask HN: What have you built with LLMs? | news.ycombinator.com | 2024-02-05

I was working on this stuff before it was cool, so in the sense of the precursor to LLMs (and sometimes supporting LLMs still) I've built many things:
1. Games you can play with word2vec or related models (could be drop in replaced with sentence transformer). It's crazy that this is 5 years old now: https://github.com/Hellisotherpeople/Language-games
2. "Constrained Text Generation Studio" - A research project I wrote when I was trying to solve LLM's inability to follow syntactic, phonetic, or semantic constraints: https://github.com/Hellisotherpeople/Constrained-Text-Genera...
3. DebateKG - A bunch of "Semantic Knowledge Graphs" built on my pet debate evidence dataset (LLM backed embeddings indexes synchronized with a graphDB and a sqlDB via txtai). Can create compelling policy debate cases https://github.com/Hellisotherpeople/DebateKG
4. My failed attempt at a good extractive summarizer. My life work is dedicated to one day solving the problems I tried to fix with this project: https://github.com/Hellisotherpeople/CX_DB8

sycamore

1 186 9.7 Python

🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data. (by aryn-ai)

Project mention: Show HN: Sycamore – an LLM-powered semantic data preparation system for search | news.ycombinator.com | 2023-09-29

HyperTag

12 185 7.5 Python

HyperTag - Intuitive Knowledge Management WebApp & CLI for Humans using Deep Learning & Tags
bert-solr-search

2 162 2.4 Python

Search with BERT vectors in Solr, Elasticsearch, OpenSearch and GSI APU
semantic-search-app-template

2 114 3.8 Python

Tutorial and template for a semantic search app powered by the Atlas Embedding Database, Langchain, OpenAI and FastAPI
DocumentGPT

1 105 8.1 Python

DocumentGPT is a web application that allows you to chat over your research document using OpenAI's chat API and perform semantic search using vector databases. This tool provides a seamless interface for interacting with your research document, exploring search results, and engaging in a conversation with an AI chatbot.

Project mention: DocumentGPT with Agents | /r/StreamlitOfficial | 2023-07-07

Was really excited to get everything working! Check it out at: https://github.com/aju22/DocumentGPT

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python semantic-search discussion

Python semantic-search related posts

What contributing to Open-source is, and what it isn't

1 project | news.ycombinator.com | 27 Apr 2024
Build knowledge graphs with LLM-driven entity extraction

1 project | dev.to | 21 Feb 2024
How to Build a Semantic Search Engine for Emojis

1 project | dev.to | 7 Feb 2024
Bootstrap or VC?

1 project | news.ycombinator.com | 5 Feb 2024
txtai: An embeddings database for semantic search, graph networks and RAG

1 project | news.ycombinator.com | 3 Feb 2024
Are we at peak vector database?

8 projects | news.ycombinator.com | 25 Jan 2024
Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?

12 projects | news.ycombinator.com | 24 Dec 2023
A note from our sponsor - SaaSHub
www.saashub.com | 16 Jun 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source semantic-search projects in Python? This list will help you:

	Project	Stars
1	MindsDB	21,794
2	haystack	14,279
3	khoj	11,317
4	txtai	7,265
5	GPTCache	6,595
6	marqo	4,248
7	llmware	4,142
8	lancedb	3,312
9	Top2Vec	2,866
10	docarray	2,812
11	semantra	2,361
12	mteb	1,548
13	yt-fts	1,368
14	uform	935
15	primeqa	709
16	neural-cherche	317
17	cherche	316
18	CX_DB8	222
19	sycamore	186
20	HyperTag	185
21	bert-solr-search	162
22	semantic-search-app-template	114
23	DocumentGPT	105