Top 23 Embedding Open-Source Projects

supabase

767 65,869 10.0 TypeScript

The open source Firebase alternative.

Project mention: How to get free Postgres | dev.to | 2024-04-24

Sign up for SupaBase: Head over to SupaBase and sign up. Create a new workspace and project with your preferred names.

quivr

22 32,240 9.9 TypeScript

Your GenAI Second Brain 🧠 A personal productivity assistant (RAG) ⚡️🤖 Chat with your docs (PDF, CSV, ...) & apps using Langchain, GPT 3.5 / 4 turbo, Private, Anthropic, VertexAI, Ollama, LLMs, Groq that you can share with users ! Local & Private alternative to OpenAI GPTs & ChatGPT powered by retrieval-augmented generation.

Project mention: privateGPT VS quivr - a user suggested alternative | libhunt.com/r/privateGPT | 2024-01-12

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
chroma

32 12,189 9.7 Python

the AI-native open-source embedding database

Project mention: Let’s build AI-tools with the help of AI and Typescript! | dev.to | 2024-04-23

Package installer for Python (pip), we use this for installing the Python-based packages, such as Jupyter Lab, and we're going to use this for installing other Python-based tools like the Chroma DB vector database

h2ogpt

28 10,398 10.0 Python

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/

Project mention: Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023? | news.ycombinator.com | 2023-12-24

As others have said you want RAG.
The most feature complete implementation I've seen is h2ogpt[0] (not affiliated).
The code is kind of a mess (most of the logic is in an ~8000 line python file) but it supports ingestion of everything from YouTube videos to docx, pdf, etc - either offline or from the web interface. It uses langchain and a ton of additional open source libraries under the hood. It can run directly on Linux, via docker, or with one-click installers for Mac and Windows.
It has various model hosting implementations built in - transformers, exllama, llama.cpp as well as support for model serving frameworks like vLLM, HF TGI, etc or just OpenAI.
You can also define your preferred embedding model along with various other parameters but I've found the out of box defaults to be pretty sane and usable.
[0] - https://github.com/h2oai/h2ogpt

txtai

354 6,953 9.3 Python

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

Project mention: Build knowledge graphs with LLM-driven entity extraction | dev.to | 2024-02-21

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

pytorch-metric-learning

3 5,764 7.9 Python

The easiest way to use deep metric learning in your application. Modular, flexible, and extensible. Written in PyTorch.
generative-ai

1 5,396 9.7 Jupyter Notebook

Sample code and notebooks for Generative AI on Google Cloud (by GoogleCloudPlatform)

Project mention: Google Imagen 2 | news.ycombinator.com | 2023-12-13

I've used the code based on similar examples from GitHub [1]. According to docs [2], imagegeneration@005 was released on the 11th, so I guessed it's Imagen 2, though there are no confirmations.
[1] https://github.com/GoogleCloudPlatform/generative-ai/blob/ma...
[2] https://console.cloud.google.com/vertex-ai/publishers/google...

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
paradedb

16 3,803 9.8 Rust

Postgres for Search and Analytics

Project mention: Using ClickHouse to scale an events engine | news.ycombinator.com | 2024-04-11

hub

1 3,436 3.7 Python

A library for transfer learning by reusing parts of TensorFlow models. (by tensorflow)
lance

10 3,256 9.8 Rust

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..

Project mention: The Nimble File Format by Meta | news.ycombinator.com | 2024-04-25

llmware

9 3,127 9.8 Python

Providing enterprise-grade LLM-based development framework, tools, and fine-tuned models.

Project mention: More Agents Is All You Need: LLMs performance scales with the number of agents | news.ycombinator.com | 2024-04-06

I couldn't agree more. You should check out LLMWare's SLIM agents (https://github.com/llmware-ai/llmware/tree/main/examples/SLI...). It's focusing on pretty much exactly this and chaining multiple local LLMs together.
A really good topic that ties in with this is the need for deterministic sampling (I may have the terminology a bit incorrect) depending on what the model is indended for. The LLMWare team did a good 2 part video on this here as well (https://www.youtube.com/watch?v=7oMTGhSKuNY)
I think dedicated miniture LLMs are the way forward.
Disclaimer - Not affiliated with them in any way, just think it's a really cool project.

towhee

26 2,989 8.6 Python

Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.

Project mention: FLaNK Stack Weekly for 14 Aug 2023 | dev.to | 2023-08-14

lightly

16 2,741 9.0 Python

A python library for self-supervised learning on images.

Project mention: Show HN: Lightly – A Python library for self-supervised learning on images | news.ycombinator.com | 2023-11-16

ml-surveys

1 2,736 0.0

📋 Survey papers summarizing advances in deep learning, NLP, CV, graphs, reinforcement learning, recommendations, graphs, etc.
text-embeddings-inference

3 1,982 8.9 Rust

A blazing fast inference solution for text embeddings models

Project mention: HuggingFace text-generation-inference is reverting to Apache 2.0 License | news.ycombinator.com | 2024-04-08

Worth noting that this also impacts the great https://github.com/huggingface/text-embeddings-inference, which allows anyone to run state of the art embeddings with great performance.

awesome-generative-ai

5 1,971 9.5 Jupyter Notebook

A curated list of Generative AI tools, works, models, and references (by filipecalegario)

Project mention: Generative AI – A curated list of Generative AI tools, works, models | news.ycombinator.com | 2023-07-14

obsidian-smart-connections

25 1,837 9.6 JavaScript

Chat with your notes & see links to related content with AI embeddings. Use local models or 100+ via APIs like Claude, Gemini, ChatGPT & Llama 3

Project mention: Ask HN: How are you currently using AI (personally or professionally)? | news.ycombinator.com | 2023-07-26

For my personal notes, I use Smart Connections[1] with Obsidian. I am considering devising my own solution using LlamaIndex[2] in the near future.
For coding, I use Copilot[3]. While it's been great for writing boilerplate code, it falls short in every other regard. I also had the opportunity to try the new version of Copilot as well, but it feels like a glorified ChatGPT inside VSCode.
For everything else, I use a tiny tool I made[4] which enables me to invoke my own prompts in basically any application that allows me to select text.
[1] https://github.com/brianpetro/obsidian-smart-connections
[2] https://gpt-index.readthedocs.io/en/latest/getting_started/s...
[3] https://github.com/features/copilot
[4] https://github.com/overflowy/chat-key

GPTDiscord

4 1,780 9.2 Python

A robust, all-in-one GPT interface for Discord. ChatGPT-style conversations, image generation, AI-moderation, custom indexes/knowledgebase, youtube summarizer, and more!

Project mention: Full-environment code interpreter in discord (just like ChatGPT!) + Tons of other features like multi-modality chat, internet-connected chat, chatting with your documents, and more! | /r/SideProject | 2023-10-31

instructor-embedding

4 1,695 6.1 Python

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings

Project mention: My experience on starting with fine tuning LLMs with custom data | /r/LocalLLaMA | 2023-07-10

If you li embeddings and vector DB, you should look into this: https://github.com/HKUNLP/instructor-embedding

featureform

28 1,674 9.7 Jupyter Notebook

The Virtual Feature Store. Turn your existing data infrastructure into a feature store.

Project mention: Still look familiar? | /r/u_featureform | 2023-07-13

magnitude

5 1,611 0.0 Python

A fast, efficient universal vector embedding utility package.
eda_nlp

1 1,536 0.0 Python

Data augmentation for NLP, presented at EMNLP 2019
contextualized-topic-models

7 1,157 5.0 Python

A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021.
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Embeddings related posts

Let’s build AI-tools with the help of AI and Typescript!
5 projects | dev.to | 23 Apr 2024
The Illustrated Word2Vec
3 projects | news.ycombinator.com | 19 Apr 2024
Mixtral 8x22B
4 projects | news.ycombinator.com | 17 Apr 2024
Embeddings are a good starting point for the AI curious app developer
7 projects | news.ycombinator.com | 17 Apr 2024
HuggingFace text-generation-inference is reverting to Apache 2.0 License
2 projects | news.ycombinator.com | 8 Apr 2024
Show HN: Chromem-go – Embeddable vector database for Go
4 projects | news.ycombinator.com | 5 Apr 2024
FastLLM by Qdrant – lightweight LLM tailored For RAG
1 project | news.ycombinator.com | 1 Apr 2024
A note from our sponsor - SaaSHub
www.saashub.com | 26 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Embedding projects? This list will help you:

	Project	Stars
1	supabase	65,869
2	quivr	32,240
3	chroma	12,189
4	h2ogpt	10,398
5	txtai	6,953
6	pytorch-metric-learning	5,764
7	generative-ai	5,396
8	paradedb	3,803
9	hub	3,436
10	lance	3,256
11	llmware	3,127
12	towhee	2,989
13	lightly	2,741
14	ml-surveys	2,736
15	text-embeddings-inference	1,982
16	awesome-generative-ai	1,971
17	obsidian-smart-connections	1,837
18	GPTDiscord	1,780
19	instructor-embedding	1,695
20	featureform	1,674
21	magnitude	1,611
22	eda_nlp	1,536
23	contextualized-topic-models	1,157