Python Clustering

Open-source Python projects categorized as Clustering

Top 23 Python Clustering Projects

  • orange

    🍊 :bar_chart: :bulb: Orange: Interactive data analysis

    Project mention: Taxonomy Management? | /r/technicalwriting | 2023-12-05

    First is identifying the "similar" things in a corpus. Best way I know to do that, for non-programmer audiences, is the Orange Data Mining tool, which gives you a node-based text mining interface to perform statistical analysis on text. Hierarchical Clustering shows - very rapidly - how similar your "modules" are, which ones are most similar. There's many other techniques (semantic viewer, similarity hash, etc) as well - the right one will depend on how your content is laying about.

  • dedupe

    :id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

    Project mention: Using deep learning for Fuzzy Matching | /r/datascience | 2023-07-06
  • awesome-community-detection

    A curated list of community detection research papers with implementations.

  • uis-rnn

    This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

  • minisom

    :red_circle: MiniSom is a minimalistic implementation of the Self Organizing Maps

  • Unsupervised-Classification

    SCAN: Learning to Classify Images without Labels, incl. SimCLR. [ECCV 2020]

  • mteb

    MTEB: Massive Text Embedding Benchmark

    Project mention: AI for AWS Documentation | | 2023-07-06

    RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:

    - Chunking can interfer with context boundaries

    - Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)

    - Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)

    - RAG will miserably fail with requests like "summarize the whole document"

    - to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings:


  • similarity

    TensorFlow Similarity is a python package focused on making similarity learning quick and easy.

  • PyPOTS

    A Python toolbox/library for reality-centric machine learning/deep learning on partially-observed time series with PyTorch, including SOTA models supporting tasks of imputation, classification, clustering, and forecasting on incomplete (irregularly-sampled) multivariate time series with NaN missing values/data.

    Project mention: Missing values in time series collected from the real world are common to see and very pesky. A new state-of-the-art and fast neural network called SAITS is proposed to impute missing data in partially-observed multivariate time series. The code is open source on GitHub. | /r/datascience | 2023-06-28

    Oh, wow, thanks for sharing it here! PyPOTS still has a long way to go, and I'm making it better. If you have any suggestions for PyPOTS, please let me know. Your feedback is always welcome and means a lot to the community of PyPOTS! If you like PyPOTS, please star 🌟 PyPOTS repo on GitHub and share it with people you know who may need it to help others notice this helpful work. Thank you very much!

  • Unsupervised-Semantic-Segmentation

    Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals. [ICCV 2021]

  • matrixprofile

    A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms, accessible to everyone.

  • slot-attention

    Implementation of Slot Attention from GoogleAI

    Project mention: Object-Centric Learning with Slot Attention | | 2023-03-27

    TEXTOIR is the first opensource toolkit for open intent recognition. (ACL 2021)

  • fuzzy-c-means

    A simple python implementation of Fuzzy C-means algorithm.

  • stringlifier

    Stringlifier is on Opensource ML Library for detecting random strings in raw text. It can be used in sanitising logs, detecting accidentally exposed credentials and as a pre-processing step in unsupervised ML-based analysis of application text data.

  • DBCV

    Python implementation of Density-Based Clustering Validation

  • n2d

    A deep clustering algorithm. Code to reproduce results for the paper N2D: (Not Too) Deep Clustering via Clustering the Local Manifold of an Autoencoded Embedding.

  • hazelcast-python-client

    Hazelcast Python Client

  • text-summarizer

    Python Framework for Extractive Text Summarization

  • Revisiting-Contrastive-SSL

    Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations. [NeurIPS 2021]

  • relevanceai

    Home of the AI workforce - Multi-agent system, AI agents & tools

  • impfuzzy

    Fuzzy Hash calculated from import API of PE files

  • hironex

    HiRoNEx: Historical road network extractor: A python tool for automatic, fully unsupervised extraction of historical road networks from historical maps.

NOTE: The open source projects on this list are ordered by number of github stars.

