Faiss: A library for efficient similarity search

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

faiss

70 28,054 9.4 C++

A library for efficient similarity search and clustering of dense vectors.

There is a wip [0] on RAFT [1] integration to faiss as an implementation of cuda gpu backed indices, although you can use RAFT directly.
[0] https://github.com/facebookresearch/faiss/pull/2521

hnsqlite

6 143 5.5 Python

hnsqlite integrates hnswlib and sqlite for simple text embedding search

hnswlib (https://github.com/nmslib/hnswlib) is a strong alternative to faiss that I have enjoyed using for multiple projects. It is simple and has great performance on CPU.
After working through several projects that utilized local hnswlib and different databases for text and vector persistence, I integrated hnswlib with sqlite to create an embedded vector search engine that can easily scale up to millions of embeddings. For self-hosted situations of under 10M embeddings and less than insane throughput I think this combo is hard to beat.
https://github.com/jiggy-ai/hnsqlite

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
ann-benchmarks

51 4,588 8.1 Python

Benchmarks of approximate nearest neighbor libraries in Python

Check out https://github.com/erikbern/ann-benchmarks for some comparisons of different benchmarks. I'd be interested in hearing other's experiences using these libraries in production.

raft

3 608 9.6 Cuda

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications. (by rapidsai)
Milvus

104 26,857 10.0 Go

A cloud-native vector database, storage for next generation AI applications

Faiss is a wonderful vector search library - in particular, the ability to do hybrid indexes e.g. IVF-PQ, IVF-SQ is great. We (https://milvus.io) use it as one of the indexing options (along with Annoy, Nmslib, and DiskANN) to power our vector database.

txtai

354 6,953 9.3 Python

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

txtai combines Faiss and SQLite to support similarity search with SQL.
For example: SELECT id, text, date FROM txtai WHERE similar('machine learning') AND date >= '2023-03-30'
GitHub: https://github.com/neuml/txtai
This article is a deep dive on how the index format works: https://neuml.hashnode.dev/anatomy-of-a-txtai-index

marqo

114 4,111 9.3 Python

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

A big difficulty in using vector DBs in production for things like embeddings or LLMs it that there is alot that goes into converting and processing raw input into a vector form (think chunking, formatting, encoding, inference, metadata, etc). DBs like pinecone just don't handle any of that and therefore you have to build out large systems to do it yourself.
There are some platforms and open source tools that handle it end to end. https://github.com/marqo-ai/marqo is one, for example that is both open source and has a cloud offering.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
faiss-rs

2 185 6.5 Rust

Rust language bindings for Faiss

I know rust has beings to FAISS (see https://github.com/Enet4/faiss-rs), I don't know if there's anything that would be considered comparable. Alot of work has gone into FAISS

qdrant

140 17,839 9.9 Rust

Qdrant - High-performance, massive-scale Vector Database for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/

I forgot about https://github.com/qdrant/qdrant. It's a DB not a library so again may not be an exact answer for what you're looking for

hora

9 2,554 0.0 Rust

🚀 efficient approximate nearest neighbor search algorithm collections library written in Rust 🦀 .

Maybe https://github.com/hora-search/hora but I've never used it

hnswlib

12 4,000 6.6 C++

Header-only C++/python library for fast approximate nearest neighbors

hnswlib (https://github.com/nmslib/hnswlib) is a strong alternative to faiss that I have enjoyed using for multiple projects. It is simple and has great performance on CPU.
After working through several projects that utilized local hnswlib and different databases for text and vector persistence, I integrated hnswlib with sqlite to create an embedded vector search engine that can easily scale up to millions of embeddings. For self-hosted situations of under 10M embeddings and less than insane throughput I think this combo is hard to beat.
https://github.com/jiggy-ai/hnsqlite

annoy

40 12,692 5.3 C++

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

I like Faiss but I tried Spotify's annoy[1] for a recent project and was pretty impressed.
Since lots of people don't seem to understand how useful these embedding libraries are here's an example. I built a thing that indexes bouldering and climbing competition videos, then builds an embedding of the climber's body position per frame. I then can automatically match different climbers on the same problem.
It works pretty well. Since the body positions are 3D it works reasonably well across camera angles.
The biggest problem is getting the embedding right. I simplified it a lot above because I actually need to embed the problem shape itself because otherwise it matches too well: you get frames of people in identical positions but on different problems!
[1] https://github.com/spotify/annoy

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

[P] I trained an AI model on 120M+ songs from iTunes
2 projects | /r/MachineLearning | 4 Feb 2023
Show HN: Danswer – open-source question answering across all your docs
7 projects | news.ycombinator.com | 10 Jul 2023
I've changed my mind about Code Interpretor
3 projects | /r/ChatGPT | 9 Jul 2023
FANN: Vector Search in 200 Lines of Rust
8 projects | news.ycombinator.com | 15 Jun 2023
Top 10 Best Vector Databases & Libraries
10 projects | dev.to | 19 Apr 2023

Faiss: A library for efficient similarity search

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
approximate-nearest-neighbor-search similarity-search image-search vector-search Search
Post date: 30 Mar 2023

faiss

hnsqlite

WorkOS

ann-benchmarks

raft

Milvus

txtai

marqo

InfluxDB

faiss-rs

qdrant

hora

hnswlib

annoy

Related posts

Faiss: A library for efficient similarity search

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com approximate-nearest-neighbor-search similarity-search image-search vector-search Search Post date: 30 Mar 2023

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
approximate-nearest-neighbor-search similarity-search image-search vector-search Search
Post date: 30 Mar 2023