Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Seems the author is proposing LSH instead of vectors for doing ANN?
There are benchmarks here, http://ann-benchmarks.com/ , but LSH underperforms the state of the art ANN algorithms like HNSW on recall/throughput.
LSH I believe was state of the art 10ish years ago, but has since been surpassed. Although the caching aspect is really nice.
Hashes are great, but to say that "vectors are over" is just plain nonsense. We continue to see vectors as a core part of production systems for entity representation and recommendation (example: https://slack.engineering/recommend-api) and within models themselves (example: multimodal and diffusion models). For folks into metrics, we're building a vector database specifically for storing, indexing, and searching across massive quantities of vectors (https://github.com/milvus-io/milvus), and we've seen close to exponential growth in terms of total downloads.
Vectors are just getting started.
https://github.com/serega/gaoya
HNSW index is slow to construct, so it is best suited for search or recommendation engines where you build the index and serve. For workloads where you continuously mutate the index, like streaming clustering/deduplication LSH outperforms HNSW.