Approximate Nearest Neighbors Oh Yeah

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

faiss

70 28,202 9.4 C++

A library for efficient similarity search and clustering of dense vectors.

If you want to experiment with vector stores, you can do that locally with something like faiss which has good platform support: https://github.com/facebookresearch/faiss
Doing full retrieval-augmented generation (RAG) and getting LLMs to interpret the results has more steps but you get a lot of flexibility, and there's no standard best-practice. When you use a vector DB you get the most similar texts back (or an index integer in the case of faiss), you then feed those to an LLM like a normal prompt.
The codifer for the RAG workflow is LangChain, but their demo is substantially more complex and harder-to-use than even a homegrown implementation: https://news.ycombinator.com/item?id=36725982

txtai

355 6,990 9.3 Python

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

If you want an easy way to evaluate Faiss, Hnswlib and Annoy vector backends, check out txtai - https://github.com/neuml/txtai. txtai also supports NumPy and PyTorch vector storage.
Disclaimer: I am the author of txtai

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
voyager

4 1,155 7.9 C++

🛰️ An approximate nearest-neighbor search library for Python and Java with a focus on ease of use, simplicity, and deployability. (by spotify)

Annoy came out of Spotify, and they just announced their successor library Voyager [1] last week [2].
[1]: https://github.com/spotify/voyager

ann-benchmarks

51 4,588 8.1 Python

Benchmarks of approximate nearest neighbor libraries in Python

https://ann-benchmarks.com/ is a good resource covering those libraries and much more.

np-sims

2 12 8.6 Python

numpy ufuncs for vector similarity

I implemented this recently in C as a numpy extension[1], for. fun. Even had a vectorized solution going.
You'll get diminishing returns on recall pretty fast. There's actually a theorem that tracks this - Jordan-Lindenstrauss lemma[2] if you're interested.
As I mention in a talk I gave[3], it can work if you're going to rerank anyway. And whatever vector search thing isn't the main ranking signal. It's also easy to update, as the hashes are non-parametric (they don't depend on the data).
The lack of data-dependency, however is the main problem. Vector spaces are lumpy. You can see this in the distribution of human beings on the surface of the earth - postal codes and area codes vary from small to huge - random hashes, like a grid, wouldn't let you accurately map out the distribution of all the people or clump them close to their actual nearest neighbors. Manhattan is not rural Alaska.
Annoy, actually, builds on these hashes, by creating many trees of such hashes, and then finds a split in the left and right. Then in creates a forest of such trees. So its essentially a forest of random hash trees with data dependency.
Hope that helps.
1 - https://github.com/softwaredoug/np-sims

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Vector Search with OpenAI Embeddings: Lucene Is All You Need
2 projects | news.ycombinator.com | 3 Sep 2023
Do we need a specialized vector database?
2 projects | news.ycombinator.com | 12 Aug 2023
Comparison of Vector Databases
7 projects | news.ycombinator.com | 31 Jul 2023
The Vector Database Index: Who, what, why now, & how
3 projects | news.ycombinator.com | 20 Sep 2022
What contributing to Open-source is, and what it isn't
1 project | news.ycombinator.com | 27 Apr 2024

Approximate Nearest Neighbors Oh Yeah

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Python Machine Learning nearest-neighbors Hnsw Search
Post date: 30 Oct 2023

faiss

txtai

WorkOS

voyager

ann-benchmarks

np-sims

InfluxDB

Related posts

Approximate Nearest Neighbors Oh Yeah

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Python Machine Learning nearest-neighbors Hnsw Search Post date: 30 Oct 2023

faiss

txtai

WorkOS

voyager

ann-benchmarks

np-sims

InfluxDB

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Python Machine Learning nearest-neighbors Hnsw Search
Post date: 30 Oct 2023