What Is a Vector Database

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • pgvector

    Open-source vector similarity search for Postgres

  • If you want to play with a vector database and already use postgres, there's pgvector[0]. Pretty easy to spin up and Supabase wrote a solid tutorial[1] (you don't need to run it on Supabase).

    0 - https://github.com/pgvector/pgvector

    1 - https://supabase.com/blog/openai-embeddings-postgres-vector

  • Milvus

    A cloud-native vector database, storage for next generation AI applications

  • Why not choose an open-source solution https://github.com/milvus-io/milvus, free!

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • faiss

    A library for efficient similarity search and clustering of dense vectors.

  • With 4B vectors, you can look at methods like quantization and compression, both detailed here for Faiss - https://github.com/facebookresearch/faiss/wiki/Indexing-1G-v...

    Elasticsearch uses HNSW, not sure what options they have but quantization/compression will help reduce disk storage requirements. Alternatively, you can look at dimensionality reduction algorithms and only store that output in ES. Or pick a model with a small number of dimensions. For example https://huggingface.co/sentence-transformers/all-MiniLM-L6-v... only has 384 dims vs 768/1024/2048/4096.

  • qdrant

    Qdrant - High-performance, massive-scale Vector Database for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/

  • Well, to work on the core of the Qdrant engine https://github.com/qdrant/qdrant you should have some db knowledge but even more important are Rust skills. However, we have also other products, like the cloud platform https://cloud.qdrant.io there we are looking for different skills.

  • ann-benchmarks

    Benchmarks of approximate nearest neighbor libraries in Python

  • In the ANN benchmarks Elastic sets the bottom bar afaict.http://ann-benchmarks.com/

  • Coral

    A better commenting experience from Vox Media (by coralproject)

  • The Coral Project [0] (commenting platform used on Washington Post, New York Times, The Verge) uses an Apache 2.0 license [1]. Which doesn't seem to have prevented it from raking in big SaaS customers.

    A lot of people worry about copy-cat services, but it's kind of rare that someone will be able to compete with you as the original in hosting your own service as well as you can. Especially when you consider support and maintenance requirements of a new product you aren't personally developing.

    I could see copy-cat services being more of an issue in the late stage of a product though? When everyone knows lots about how to stand it up and use it?

    [0] https://coralproject.net/

  • marqo

    Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

  • If anyone is looking for a vector search engine, see here https://github.com/marqo-ai/marqo. Has additional functionality to make vector search much easier.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • GPT4Memory

  • In the quest for ultimate speed, I started developing a vector database in assembly using gpt4 as a side project https://github.com/jn2clark/GPT4Memory.

  • hnsqlite

    hnsqlite integrates hnswlib and sqlite for simple text embedding search

  • After working through several projects that utilized local hnswlib and different databases for text and vector persistence, I integrated open source hnswlib with sqlite to create an embedded vector search engine that can easily scale up to millions of embeddings. For self-hosted situations of under 10M embeddings and less than insane throughput I think this combo is hard to beat.

    https://github.com/jiggy-ai/hnsqlite

  • Victor

    What's our vector, Victor? Victor is a toy vector database written in Go. (by corlinp)

  • Also plugging my crappy vector database, which you probably shouldn't use for anything but a fun project, however it can be set up and used in seconds. https://github.com/corlinp/Victor

  • lucene

    Apache Lucene open-source search software

  • Are they forking Lucene or somehow getting the Lucene devs to increase that limit? Because this PR has been open for over a year now: https://github.com/apache/lucene/issues/11507

  • Elasticsearch

    Free and Open, Distributed, RESTful Search Engine

  • No - they just did something in Elasticsearch to make their own FieldType https://github.com/elastic/elasticsearch/pull/95257

  • chroma

    the AI-native open-source embedding database

  • Chroma runs on Windows since I believe it's just a python package: https://github.com/chroma-core/chroma

  • similarity-search-kit

    🔎 SimilaritySearchKit is a Swift package providing on-device text embeddings and semantic search functionality for iOS and macOS applications.

  • How are you guys thinking about the embedding generation side of things? It seems like that part has a generally hefty compute cost before it even gets into the index - I just open sourced a swift package to try to make that part as easy as possible, the example project exports directly to pinecone. https://github.com/ZachNagengast/similarity-search-kit

  • cozo

    A transactional, relational-graph-vector database that uses Datalog for query. The hippocampus for AI!

  • If anyone wants to try a FOSS vector-relational-graph hybrid database for more complicated workloads than simple vector search, here it is: https://github.com/cozodb/cozo/

    About the integrated vector search: https://docs.cozodb.org/en/latest/releases/v0.6.html

    It also does duplicate detection (Minhash-LSH) and full-text search within the query language itself: https://docs.cozodb.org/en/latest/releases/v0.7.html

    HN discussion a few days ago: https://news.ycombinator.com/item?id=35641164

    Disclaimer: I wrote it.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts