What Is a Vector Database

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

pgvector

77 9,067 9.7 C

Open-source vector similarity search for Postgres

If you want to play with a vector database and already use postgres, there's pgvector[0]. Pretty easy to spin up and Supabase wrote a solid tutorial[1] (you don't need to run it on Supabase).
0 - https://github.com/pgvector/pgvector
1 - https://supabase.com/blog/openai-embeddings-postgres-vector

Milvus

104 26,645 10.0 Go

A cloud-native vector database, storage for next generation AI applications

Why not choose an open-source solution https://github.com/milvus-io/milvus, free!

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
faiss

70 28,054 9.4 C++

A library for efficient similarity search and clustering of dense vectors.

With 4B vectors, you can look at methods like quantization and compression, both detailed here for Faiss - https://github.com/facebookresearch/faiss/wiki/Indexing-1G-v...
Elasticsearch uses HNSW, not sure what options they have but quantization/compression will help reduce disk storage requirements. Alternatively, you can look at dimensionality reduction algorithms and only store that output in ES. Or pick a model with a small number of dimensions. For example https://huggingface.co/sentence-transformers/all-MiniLM-L6-v... only has 384 dims vs 768/1024/2048/4096.

qdrant

139 17,839 9.9 Rust

Qdrant - High-performance, massive-scale Vector Database for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/

Well, to work on the core of the Qdrant engine https://github.com/qdrant/qdrant you should have some db knowledge but even more important are Rust skills. However, we have also other products, like the cloud platform https://cloud.qdrant.io there we are looking for different skills.

ann-benchmarks

51 4,568 8.1 Python

Benchmarks of approximate nearest neighbor libraries in Python

In the ANN benchmarks Elastic sets the bottom bar afaict.http://ann-benchmarks.com/

Coral

10 1,864 9.9 TypeScript

A better commenting experience from Vox Media (by coralproject)

The Coral Project [0] (commenting platform used on Washington Post, New York Times, The Verge) uses an Apache 2.0 license [1]. Which doesn't seem to have prevented it from raking in big SaaS customers.
A lot of people worry about copy-cat services, but it's kind of rare that someone will be able to compete with you as the original in hosting your own service as well as you can. Especially when you consider support and maintenance requirements of a new product you aren't personally developing.
I could see copy-cat services being more of an issue in the late stage of a product though? When everyone knows lots about how to stand it up and use it?
[0] https://coralproject.net/

marqo

114 4,111 9.3 Python

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

If anyone is looking for a vector search engine, see here https://github.com/marqo-ai/marqo. Has additional functionality to make vector search much easier.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
GPT4Memory

1 1 3.5 Assembly

In the quest for ultimate speed, I started developing a vector database in assembly using gpt4 as a side project https://github.com/jn2clark/GPT4Memory.

hnsqlite

6 143 5.5 Python

hnsqlite integrates hnswlib and sqlite for simple text embedding search

After working through several projects that utilized local hnswlib and different databases for text and vector persistence, I integrated open source hnswlib with sqlite to create an embedded vector search engine that can easily scale up to millions of embeddings. For self-hosted situations of under 10M embeddings and less than insane throughput I think this combo is hard to beat.
https://github.com/jiggy-ai/hnsqlite

Victor

1 5 3.8 Go

What's our vector, Victor? Victor is a toy vector database written in Go. (by corlinp)

Also plugging my crappy vector database, which you probably shouldn't use for anything but a fun project, however it can be set up and used in seconds. https://github.com/corlinp/Victor

lucene

11 2,344 9.8 Java

Apache Lucene open-source search software

Are they forking Lucene or somehow getting the Lucene devs to increase that limit? Because this PR has been open for over a year now: https://github.com/apache/lucene/issues/11507

Elasticsearch

91 67,531 10.0 Java

Free and Open, Distributed, RESTful Search Engine

No - they just did something in Elasticsearch to make their own FieldType https://github.com/elastic/elasticsearch/pull/95257

chroma

32 12,189 9.7 Python

the AI-native open-source embedding database

Chroma runs on Windows since I believe it's just a python package: https://github.com/chroma-core/chroma

similarity-search-kit

4 259 6.5 Swift

🔎 SimilaritySearchKit is a Swift package providing on-device text embeddings and semantic search functionality for iOS and macOS applications.

How are you guys thinking about the embedding generation side of things? It seems like that part has a generally hefty compute cost before it even gets into the index - I just open sourced a swift package to try to make that part as easy as possible, the example project exports directly to pinecone. https://github.com/ZachNagengast/similarity-search-kit

cozo

29 3,081 8.4 Rust

A transactional, relational-graph-vector database that uses Datalog for query. The hippocampus for AI!

If anyone wants to try a FOSS vector-relational-graph hybrid database for more complicated workloads than simple vector search, here it is: https://github.com/cozodb/cozo/
About the integrated vector search: https://docs.cozodb.org/en/latest/releases/v0.6.html
It also does duplicate detection (Minhash-LSH) and full-text search within the query language itself: https://docs.cozodb.org/en/latest/releases/v0.7.html
HN discussion a few days ago: https://news.ycombinator.com/item?id=35641164
Disclaimer: I wrote it.

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Storing OpenAI embeddings in Postgres with pgvector
9 projects | news.ycombinator.com | 6 Feb 2023
Simplifying the Milvus Selection Process
3 projects | dev.to | 19 Feb 2024
What is Hybrid Search?
6 projects | dev.to | 6 Feb 2024
Milvus Adventures Dec 15, 2023
1 project | dev.to | 15 Dec 2023
GPU-Accelerated Indexing in LanceDB
1 project | news.ycombinator.com | 3 Nov 2023

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Search nearest-neighbor-search approximate-nearest-neighbor-search search-engine Database
Post date: 5 May 2023

pgvector

Milvus

WorkOS

faiss

qdrant

ann-benchmarks

Coral

marqo

InfluxDB

GPT4Memory

hnsqlite

Victor

lucene

Elasticsearch

chroma

similarity-search-kit

cozo

SaaSHub

Related posts

What Is a Vector Database

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Search nearest-neighbor-search approximate-nearest-neighbor-search search-engine Database Post date: 5 May 2023

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Search nearest-neighbor-search approximate-nearest-neighbor-search search-engine Database
Post date: 5 May 2023