[D] Good algorithm for clustering big data (sentences represented as embeddings)?

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • faiss

    A library for efficient similarity search and clustering of dense vectors.

  • Depends on what exactly is the use case behind this clustering. If it is for retrieval based on similarity search, then Faiss library is pretty good. It has several implementations of indexing structures with various tradeoffs wrt speed, accuracy, memory, dataset, etc, and has GPU support as well.

  • hdbscan

    A high performance implementation of HDBSCAN clustering.

  • Maybe use (H)DBScan which I think should work also for huge datasets. I don't think there is a ready to use clustering with unbuild cosine similarily metrics, and you also won't be able to precompute the 100k X 100k dense similarity matrix. The only way to go on this is to L2 normalize your embeddings, then the dot product will be the angular distance as a proxy to the cosine similarily. See also https://github.com/scikit-learn-contrib/hdbscan/issues/69

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Top2Vec

    Top2Vec learns jointly embedded topic, document and word vectors.

  • Milvus

    A cloud-native vector database, storage for next generation AI applications

  • https://milvus.io/ could be a good alternative for you. Just run the server in docker and make use of the rest apis to add embeddings and query data.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Computer Vision Meetup: Develop a Legal Search Application from Scratch using Milvus and DSPy!

    2 projects | dev.to | 2 May 2024
  • Build an Image Search Engine in Minutes

    3 projects | /r/Python | 15 May 2022
  • Building a reverse image search system with Towhee

    4 projects | dev.to | 6 Jan 2022
  • SOTA for Topic Modeling

    2 projects | /r/LanguageTechnology | 25 Mar 2021
  • What is RocksDB (and its role in streaming)?

    3 projects | dev.to | 13 May 2024