Our great sponsors
-
annoy
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
There is a technique called similarity hashing to find sets of similar items in a large database in O(1) time. (Indexing costs O(n)). Maybe you find the Annoy library useful: https://github.com/spotify/annoy
There is actually a whole family of hashing functions called locality sensitive hashing functions (LSH) that have the property that the likelihood of a hash collision is proportional to the similarity of the hashed data values. I’ve used Simhash myself for textual similarity, but LSHs can be used for finding similar images, audio, or other data types.
Related posts
- Vector Databases 101
- I'm an undergraduate data science intern and trying to run kmodes clustering. Did this elbow method to figure out how many clusters to use, but I don't really see an "elbow". Tips on number of clusters?
- Calculating document similarity in a special domain
- Can Parquet file format index string columns?
- Billion-Scale Approximate Nearest Neighbor Search [pdf]