Approximate Nearest Neighbors Oh Yeah

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Scout Monitoring - Free Django app performance insights with Scout Monitoring
Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
www.scoutapm.com
featured
InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
  • faiss

    A library for efficient similarity search and clustering of dense vectors.

    If you want to experiment with vector stores, you can do that locally with something like faiss which has good platform support: https://github.com/facebookresearch/faiss

    Doing full retrieval-augmented generation (RAG) and getting LLMs to interpret the results has more steps but you get a lot of flexibility, and there's no standard best-practice. When you use a vector DB you get the most similar texts back (or an index integer in the case of faiss), you then feed those to an LLM like a normal prompt.

    The codifer for the RAG workflow is LangChain, but their demo is substantially more complex and harder-to-use than even a homegrown implementation: https://news.ycombinator.com/item?id=36725982

  • Scout Monitoring

    Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

    Scout Monitoring logo
  • txtai

    💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

    If you want an easy way to evaluate Faiss, Hnswlib and Annoy vector backends, check out txtai - https://github.com/neuml/txtai. txtai also supports NumPy and PyTorch vector storage.

    Disclaimer: I am the author of txtai

  • voyager

    🛰️ An approximate nearest-neighbor search library for Python and Java with a focus on ease of use, simplicity, and deployability. (by spotify)

    Annoy came out of Spotify, and they just announced their successor library Voyager [1] last week [2].

    [1]: https://github.com/spotify/voyager

  • ann-benchmarks

    Benchmarks of approximate nearest neighbor libraries in Python

    https://ann-benchmarks.com/ is a good resource covering those libraries and much more.

  • np-sims

    numpy ufuncs for vector similarity

    I implemented this recently in C as a numpy extension[1], for. fun. Even had a vectorized solution going.

    You'll get diminishing returns on recall pretty fast. There's actually a theorem that tracks this - Jordan-Lindenstrauss lemma[2] if you're interested.

    As I mention in a talk I gave[3], it can work if you're going to rerank anyway. And whatever vector search thing isn't the main ranking signal. It's also easy to update, as the hashes are non-parametric (they don't depend on the data).

    The lack of data-dependency, however is the main problem. Vector spaces are lumpy. You can see this in the distribution of human beings on the surface of the earth - postal codes and area codes vary from small to huge - random hashes, like a grid, wouldn't let you accurately map out the distribution of all the people or clump them close to their actual nearest neighbors. Manhattan is not rural Alaska.

    Annoy, actually, builds on these hashes, by creating many trees of such hashes, and then finds a split in the left and right. Then in creates a forest of such trees. So its essentially a forest of random hash trees with data dependency.

    Hope that helps.

    1 - https://github.com/softwaredoug/np-sims

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Vector Search with OpenAI Embeddings: Lucene Is All You Need

    2 projects | news.ycombinator.com | 3 Sep 2023
  • Do we need a specialized vector database?

    2 projects | news.ycombinator.com | 12 Aug 2023
  • Comparison of Vector Databases

    7 projects | news.ycombinator.com | 31 Jul 2023
  • The Vector Database Index: Who, what, why now, & how

    3 projects | news.ycombinator.com | 20 Sep 2022
  • txtai: Vector search, Knowledge Graphs, RAG and LLM workflows locally run

    2 projects | news.ycombinator.com | 29 Jun 2024

Did you konow that Python is
the 1st most popular programming language
based on number of metions?