Approximate Nearest Neighbors Oh Yeah

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • faiss

    A library for efficient similarity search and clustering of dense vectors.

  • If you want to experiment with vector stores, you can do that locally with something like faiss which has good platform support: https://github.com/facebookresearch/faiss

    Doing full retrieval-augmented generation (RAG) and getting LLMs to interpret the results has more steps but you get a lot of flexibility, and there's no standard best-practice. When you use a vector DB you get the most similar texts back (or an index integer in the case of faiss), you then feed those to an LLM like a normal prompt.

    The codifer for the RAG workflow is LangChain, but their demo is substantially more complex and harder-to-use than even a homegrown implementation: https://news.ycombinator.com/item?id=36725982

  • txtai

    💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

  • If you want an easy way to evaluate Faiss, Hnswlib and Annoy vector backends, check out txtai - https://github.com/neuml/txtai. txtai also supports NumPy and PyTorch vector storage.

    Disclaimer: I am the author of txtai

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • voyager

    🛰️ An approximate nearest-neighbor search library for Python and Java with a focus on ease of use, simplicity, and deployability. (by spotify)

  • Annoy came out of Spotify, and they just announced their successor library Voyager [1] last week [2].

    [1]: https://github.com/spotify/voyager

  • ann-benchmarks

    Benchmarks of approximate nearest neighbor libraries in Python

  • https://ann-benchmarks.com/ is a good resource covering those libraries and much more.

  • np-sims

    numpy ufuncs for vector similarity

  • I implemented this recently in C as a numpy extension[1], for. fun. Even had a vectorized solution going.

    You'll get diminishing returns on recall pretty fast. There's actually a theorem that tracks this - Jordan-Lindenstrauss lemma[2] if you're interested.

    As I mention in a talk I gave[3], it can work if you're going to rerank anyway. And whatever vector search thing isn't the main ranking signal. It's also easy to update, as the hashes are non-parametric (they don't depend on the data).

    The lack of data-dependency, however is the main problem. Vector spaces are lumpy. You can see this in the distribution of human beings on the surface of the earth - postal codes and area codes vary from small to huge - random hashes, like a grid, wouldn't let you accurately map out the distribution of all the people or clump them close to their actual nearest neighbors. Manhattan is not rural Alaska.

    Annoy, actually, builds on these hashes, by creating many trees of such hashes, and then finds a split in the left and right. Then in creates a forest of such trees. So its essentially a forest of random hash trees with data dependency.

    Hope that helps.

    1 - https://github.com/softwaredoug/np-sims

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts