Do we think about vector dbs wrong?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • annoy

    Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

  • The focus on the top 10 in vector search is a product of wanting to prove value over keyword search. Keyword search is going to miss some conceptual matches. You can try to work around that with tokenization and complex queries with all variations but it's not easy.

    Vector search isn't all that new a concept. For example, the annoy library (https://github.com/spotify/annoy) has been around since 2014. It was one of the first open source approximate nearest neighbor libraries. Recommendations have always been a good use case for vector similarity.

    Recommendations are a natural extension of search and transformers models made building the vectors for natural language possible. To prove the worth of vector search over keyword search, the focus was always on showing how the top N matches include results not possible with keyword search.

    In 2023, there has been a shift towards acknowledging keyword search also has value and that a combination of vector + keyword search (aka hybrid search) operates in the sweet spot. Once again this is validated through the same benchmarks which focus on the top 10.

    On top of all this, there is also the reality that the vector database space is very crowded and some want to use their performance benchmarks for marketing.

    Disclaimer: I am the author of txtai (https://github.com/neuml/txtai), an open source embeddings database

  • txtai

    💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

  • The focus on the top 10 in vector search is a product of wanting to prove value over keyword search. Keyword search is going to miss some conceptual matches. You can try to work around that with tokenization and complex queries with all variations but it's not easy.

    Vector search isn't all that new a concept. For example, the annoy library (https://github.com/spotify/annoy) has been around since 2014. It was one of the first open source approximate nearest neighbor libraries. Recommendations have always been a good use case for vector similarity.

    Recommendations are a natural extension of search and transformers models made building the vectors for natural language possible. To prove the worth of vector search over keyword search, the focus was always on showing how the top N matches include results not possible with keyword search.

    In 2023, there has been a shift towards acknowledging keyword search also has value and that a combination of vector + keyword search (aka hybrid search) operates in the sweet spot. Once again this is validated through the same benchmarks which focus on the top 10.

    On top of all this, there is also the reality that the vector database space is very crowded and some want to use their performance benchmarks for marketing.

    Disclaimer: I am the author of txtai (https://github.com/neuml/txtai), an open source embeddings database

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • ann-benchmarks

    Benchmarks of approximate nearest neighbor libraries in Python

  • Elasticsearch

    Free and Open, Distributed, RESTful Search Engine

  • I believe the 1024 limit has been upped in recent versions of Elasticsearch

    https://github.com/elastic/elasticsearch/issues/92458

  • elk

    A nimble Mastodon web client

  • And there are tons of third party clients. I think Tusky is the best one I’ve seen for Android, and there’s an interesting web-based one called Elk that’s very nice. You load up https://elk.zone and then use it as a front-end to sign in to your server.

  • vectara-answer

    LLM-powered Conversational AI experience using Vectara

  • I agree. my experience is that hybrid search does provide better results in many cases, and is honestly not as easy to implement as may seem at first. In general, getting search right can be complicated today and the common thinking of "hey I'm going to put up a vector DB and use that" is simplistic.

    Disclaimer: I'm with Vectara (https://vectara.com), we provide an end-to-end platform for building GenAI products.

  • Weaviate

    Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database​.

  • Hey @rvrs, I work on Weaviate and we are doing some improvements around increasing write throughput:

    1. gRPC. Using gRPC to write vectors has had a really nice performance boost. It is released in Weaviate core but here is still some work on do on the clients. Feel free to get in contact if you would like to try it out.

    2. Parameter tuning. lowering `efConstruction` can speed up imports.

    3. We are also working on async indexing https://github.com/weaviate/weaviate/issues/3463 which will further speed things up.

    In comparison with pgvector, Weaviate has more flexible query options such as hybrid search and quantization to save memory on larger datasets.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts