[D] Good algorithm for clustering big data (sentences represented as embeddings)?

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

faiss

70 27,924 9.4 C++

A library for efficient similarity search and clustering of dense vectors.

Depends on what exactly is the use case behind this clustering. If it is for retrieval based on similarity search, then Faiss library is pretty good. It has several implementations of indexing structures with various tradeoffs wrt speed, accuracy, memory, dataset, etc, and has GPU support as well.

hdbscan

6 2,671 7.0 Jupyter Notebook

A high performance implementation of HDBSCAN clustering.

Maybe use (H)DBScan which I think should work also for huge datasets. I don't think there is a ready to use clustering with unbuild cosine similarily metrics, and you also won't be able to precompute the 100k X 100k dense similarity matrix. The only way to go on this is to L2 normalize your embeddings, then the dot product will be the angular distance as a proxy to the cosine similarily. See also https://github.com/scikit-learn-contrib/hdbscan/issues/69

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Top2Vec

13 2,833 7.0 Python

Top2Vec learns jointly embedded topic, document and word vectors.
Milvus

104 26,645 10.0 Go

A cloud-native vector database, storage for next generation AI applications

https://milvus.io/ could be a good alternative for you. Just run the server in docker and make use of the rest apis to add embeddings and query data.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project