Vald: A Highly Scalable Distributed Vector Search Engine

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • Weaviate

    Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database​.

    You might be interested in checking out Weaviate (https://github.com/semi-technologies/weaviate) which supports combining vector search with boolean filters. In Weaviate, the vector index is stored alongside an inverted index, so you get the best of both worlds. At the moment on a filtered search, the inverted index is hit first, so you essentially end up with an allow-list of IDs which are then fed to the vector index. If the id isn't on the allow list the next neighbor is selected and so on.

    There are also plans to further optimize this based on how restrictive the boolean filter is. For example, if the filter still matches a lot of of ids (e.g. 50% of all candidates) the approach outlined above works really well. But if if your filter is extremely restrictive and maybe only matches 1k out of 100M results, then this approach is no longer ideal, as the vector index will discard most of the ids it sees. However, in this case it would be more efficient to actually perform a brute-force search on those 1k vectors. This and other optimizations around combined vector and scalar search are on the roadmap.

  • vald

    Vald. A Highly Scalable Distributed Vector Search Engine

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • annoy

    Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

    With only 100k vectors, why didn’t you just use Annoy? It’s been around for many years and it is fast, reliable, super easy to use and handles pretty large data quite well too.

    https://github.com/spotify/annoy

  • alvd

    alvd = A Lightweight Vald. A lightweight distributed vector search engine works without K8s.

    This might be of interest to some - Vald without the Kubernetes requirement (or integration): https://github.com/rinx/alvd

  • faiss

    A library for efficient similarity search and clustering of dense vectors.

    Dense vectors are pervasive in Neural Networks. Take a layer's activations and you have a new one (which is what NN embeddings are).

    Indexing them is really hard: The local neighborhood's volume to be searched grows by radius^dimension; you need a specialized engine.

    Now, say you are a three letter agency and have Facebook's image data (2B faces, 100B images). You train a deep learning image classification network to identify faces. But for training time reasons you only can do it on a dataset of about 10k different faces, 10M images. The NN will identify characteristic visual features to make its classification, available at the n-1 layer as activations.

    But you need to have the rest of the dataset searchable! You now index the activations of all the 100B remaining images. And with this you can search by similarity of visual features. If you have an image of someone you want to to track, get its activations and search for other images that have similar vectors.

    This works for everything that a NN can model: speech, words, words in sentences, videos, etc.

    ----

    On a another note Facebook has a feature-rich, GPU-powered, scalable indexation engine called FAISS:

    https://github.com/facebookresearch/faiss

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts