Introduction to K-Means Clustering

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • hdbscan

    A high performance implementation of HDBSCAN clustering.

    Working in spatial data science, I rarely find applications where k-means is the best tool. The problem is that it is difficult to know how many clusters you can expect on maps. Is it 5, 500, or 10,000? Here HDBSCAN [1] shines because it will cluster _and_ select the most suitable number of clusters, to cut the single linkage cluster tree.

    [1]: https://github.com/scikit-learn-contrib/hdbscan

  • ckwrap

    Wrapper for Ckmeans.1d.dp.

    Note also that specifically for one-dimensional data, there is a globally optimal solution to the k-means clustering problem. There is an R package that implements it using a C++ core implementation [1], and also a Python wrapper [2].

    [1]: https://cran.r-project.org/package=Ckmeans.1d.dp

    [2]: https://github.com/djdt/ckwrap

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • groupImg

    A script in python to organize your images by similarity.

    If anyone is interested, I have two projects that uses k-means

    https://github.com/victorqribeiro/groupImg

    https://github.com/victorqribeiro/budget

    Being one of the first ML algorithms that I learned, I spend some time finding use cases for it

    If I'm not mistaken I've also used in to classify deforestation in an exercise

  • budget

    A simply budget app that predicts where the expenses are being made (by victorqribeiro)

    If anyone is interested, I have two projects that uses k-means

    https://github.com/victorqribeiro/groupImg

    https://github.com/victorqribeiro/budget

    Being one of the first ML algorithms that I learned, I spend some time finding use cases for it

    If I'm not mistaken I've also used in to classify deforestation in an exercise

  • word2vec

    Automatically exported from code.google.com/p/word2vec

    It is not necessarily the case.

    For example, word2vec uses k-means clustering using cosine similarity measure [1]. It works very, very well. The caveat is not many optimization variations of k-means will work with that "distance".

    [1] https://github.com/tmikolov/word2vec/blob/master/word2vec.c#...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts