[D] Good algorithm for clustering big data (sentences represented as embeddings)?

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • faiss

    A library for efficient similarity search and clustering of dense vectors.

  • Depends on what exactly is the use case behind this clustering. If it is for retrieval based on similarity search, then Faiss library is pretty good. It has several implementations of indexing structures with various tradeoffs wrt speed, accuracy, memory, dataset, etc, and has GPU support as well.

  • hdbscan

    A high performance implementation of HDBSCAN clustering.

  • Maybe use (H)DBScan which I think should work also for huge datasets. I don't think there is a ready to use clustering with unbuild cosine similarily metrics, and you also won't be able to precompute the 100k X 100k dense similarity matrix. The only way to go on this is to L2 normalize your embeddings, then the dot product will be the angular distance as a proxy to the cosine similarily. See also https://github.com/scikit-learn-contrib/hdbscan/issues/69

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • Top2Vec

    Top2Vec learns jointly embedded topic, document and word vectors.

  • Milvus

    A cloud-native vector database, storage for next generation AI applications

  • https://milvus.io/ could be a good alternative for you. Just run the server in docker and make use of the rest apis to add embeddings and query data.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts