Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Depends on what exactly is the use case behind this clustering. If it is for retrieval based on similarity search, then Faiss library is pretty good. It has several implementations of indexing structures with various tradeoffs wrt speed, accuracy, memory, dataset, etc, and has GPU support as well.
Maybe use (H)DBScan which I think should work also for huge datasets. I don't think there is a ready to use clustering with unbuild cosine similarily metrics, and you also won't be able to precompute the 100k X 100k dense similarity matrix. The only way to go on this is to L2 normalize your embeddings, then the dot product will be the angular distance as a proxy to the cosine similarily. See also https://github.com/scikit-learn-contrib/hdbscan/issues/69
https://milvus.io/ could be a good alternative for you. Just run the server in docker and make use of the rest apis to add embeddings and query data.