Python Clustering

Open-source Python projects categorized as Clustering

Top 23 Python Clustering Projects

  • orange

    🍊 :bar_chart: :bulb: Orange: Interactive data analysis

    Project mention: Taxonomy Management? | /r/technicalwriting | 2023-12-05

    First is identifying the "similar" things in a corpus. Best way I know to do that, for non-programmer audiences, is the Orange Data Mining tool, which gives you a node-based text mining interface to perform statistical analysis on text. Hierarchical Clustering shows - very rapidly - how similar your "modules" are, which ones are most similar. There's many other techniques (semantic viewer, similarity hash, etc) as well - the right one will depend on how your content is laying about.

  • dedupe

    :id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

    Project mention: Using deep learning for Fuzzy Matching | /r/datascience | 2023-07-06
  • WorkOS

    The modern API for authentication & user identity. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • awesome-community-detection

    A curated list of community detection research papers with implementations.

  • uis-rnn

    This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

  • minisom

    :red_circle: MiniSom is a minimalistic implementation of the Self Organizing Maps

  • Unsupervised-Classification

    SCAN: Learning to Classify Images without Labels, incl. SimCLR. [ECCV 2020]

  • mteb

    MTEB: Massive Text Embedding Benchmark

    Project mention: AI for AWS Documentation | | 2023-07-06

    RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:

    - Chunking can interfer with context boundaries

    - Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)

    - Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)

    - RAG will miserably fail with requests like "summarize the whole document"

    - to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings:


  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • similarity

    TensorFlow Similarity is a python package focused on making similarity learning quick and easy.

  • PyPOTS

    A Python toolbox/library for reality-centric machine learning/deep learning on partially-observed time series with PyTorch, including SOTA models supporting tasks of imputation, classification, clustering, and forecasting on incomplete (irregularly-sampled) multivariate time series with NaN missing values/data.

    Project mention: Missing values in time series collected from the real world are common to see and very pesky. A new state-of-the-art and fast neural network called SAITS is proposed to impute missing data in partially-observed multivariate time series. The code is open source on GitHub. | /r/datascience | 2023-06-28

    Oh, wow, thanks for sharing it here! PyPOTS still has a long way to go, and I'm making it better. If you have any suggestions for PyPOTS, please let me know. Your feedback is always welcome and means a lot to the community of PyPOTS! If you like PyPOTS, please star 🌟 PyPOTS repo on GitHub and share it with people you know who may need it to help others notice this helpful work. Thank you very much!

  • Unsupervised-Semantic-Segmentation

    Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals. [ICCV 2021]

  • matrixprofile

    A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms, accessible to everyone.

  • slot-attention

    Implementation of Slot Attention from GoogleAI

    Project mention: Object-Centric Learning with Slot Attention | | 2023-03-27

    TEXTOIR is the first opensource toolkit for open intent recognition. (ACL 2021)

  • fuzzy-c-means

    A simple python implementation of Fuzzy C-means algorithm.

  • stringlifier

    Stringlifier is on Opensource ML Library for detecting random strings in raw text. It can be used in sanitising logs, detecting accidentally exposed credentials and as a pre-processing step in unsupervised ML-based analysis of application text data.

  • DBCV

    Python implementation of Density-Based Clustering Validation

  • n2d

    A deep clustering algorithm. Code to reproduce results for the paper N2D: (Not Too) Deep Clustering via Clustering the Local Manifold of an Autoencoded Embedding.

  • hazelcast-python-client

    Hazelcast Python Client

  • text-summarizer

    Python Framework for Extractive Text Summarization

  • Revisiting-Contrastive-SSL

    Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations. [NeurIPS 2021]

  • relevanceai

    Home of the AI workforce - Multi-agent system, AI agents & tools

  • impfuzzy

    Fuzzy Hash calculated from import API of PE files

  • hironex

    HiRoNEx: Historical road network extractor: A python tool for automatic, fully unsupervised extraction of historical road networks from historical maps.

  • Onboard AI

    ChatGPT with full context of any GitHub repo. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-12-05.

Python Clustering related posts


What are some of the best open-source Clustering projects in Python? This list will help you:

Project Stars
1 orange 4,509
2 dedupe 3,929
3 awesome-community-detection 2,250
4 uis-rnn 1,525
5 minisom 1,349
6 Unsupervised-Classification 1,292
7 mteb 1,040
8 similarity 993
9 PyPOTS 578
10 Unsupervised-Semantic-Segmentation 378
11 matrixprofile 350
12 slot-attention 341
13 TEXTOIR 167
14 fuzzy-c-means 157
15 stringlifier 155
16 DBCV 131
17 n2d 121
18 hazelcast-python-client 114
19 text-summarizer 113
20 Revisiting-Contrastive-SSL 85
21 relevanceai 84
22 impfuzzy 82
23 hironex 68
ChatGPT with full context of any GitHub repo.
Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at