Python Data Mining

Open-source Python projects categorized as Data Mining

Top 23 Python Data Mining Projects

  • ML-From-Scratch

    Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

    Project mention: Tutorials on creating primitive ML algorithms from scratch? | | 2023-01-24


  • EasyOCR

    Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

    Project mention: I made a website for a friend who owns a restaurant. He's wondering if there's a way to upload a picture of his menu daily. What is the best way to do this? | | 2023-01-15
  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • gensim

    Topic Modelling for Humans

    Project mention: Understanding How Dynamic node2vec Works on Streaming Data | | 2022-12-23

    This is our optimization problem. Now, we hope that you have an idea of what our goal is. Luckily for us, this is already implemented in a Python module called gensim. Yes, these guys are brilliant in natural language processing and we will make use of it. 🤝

  • anomaly-detection-resources

    Anomaly detection related books, papers, videos, and toolboxes

    Project mention: anomaly-detection-resources: NEW Extended Research - star count:6556.0 | | 2022-11-15
  • pyod

    A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)

    Project mention: Pyod – A Comprehensive and Scalable Python Library for Outlier Detection | | 2022-08-10
  • sktime

    A unified framework for machine learning with time series

    Project mention: Does anyone know a trusted Python package for applying Croston's Time series method? | | 2022-12-04

    I initially used the SkTime's Croston class SKTime Croston but when I try to get the fitted values using the steps in the discussion on github, the values are the same, a straight line throughout the in-sample to ou-of-sample predictions.

  • orange

    🍊 :bar_chart: :bulb: Orange: Interactive data analysis

    Project mention: Statistical Analysis software based on Python? | | 2023-01-28

    Only thing I can think of is Orange, which has some statistics capability, but isn't its focus.

  • InfluxDB

    Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Platform where developers build real-time applications for analytics, IoT and cloud-native services. Easy to start, it is available in the cloud or on-premises.

  • pdftabextract

    A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

  • invoice2data

    Extract structured data from PDF invoices

    Project mention: Utilize OpenAI API to extract information from PDF files | | 2023-01-28

    Using regex: to match patterns in text after converting the PDF to plain text. Examples include invoice2data and traprange-invoice. However, this method requires knowledge of the format of the data fields.

  • pycm

    Multi-class confusion matrix library in Python

    Project mention: PyCM 3.8 Released: Distance/Similarity Support | | 2023-02-02
  • awesome-fraud-detection-papers

    A curated list of data mining papers about fraud detection.

    Project mention: awesome-fraud-detection-papers: NEW Extended Research - star count:1195.0 | | 2022-10-08
  • deep_gcns_torch

    Pytorch Repo for DeepGCNs (ICCV'2019 Oral, TPAMI'2021), DeeperGCN (arXiv'2020) and GNN1000(ICML'2021):

  • nfstream

    NFStream: a Flexible Network Data Analysis Framework.

    Project mention: Monitor your system network traffic using one line of Python | | 2022-09-28
  • URS

    Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

    Project mention: GitHub - JosephLai241/URS: Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python. | | 2022-05-27
  • instascrape

    Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically

  • pm4py-core

    Public repository for the PM4Py (Process Mining for Python) project.

    Project mention: Process Analytics - January 2023 News | | 2023-02-01

    It's a new year 🎊 and what better way to kick off 2023 than with some Process Analytics news! January brought us exciting developments in the world of bpmn-visualization and pm4py integration 🔗. With our team working hard to connect the dots, we’re making bpmn-visualization more accessible and easier to integrate with the Process Mining ecosystem.

  • UnityPy

    UnityPy is python module that makes it possible to extract/unpack and edit Unity assets

    Project mention: Show HN: Unblob – extraction suite for 30+ file formats | | 2023-01-18

    Since you're the author and I see the tool is in Python. I'm the original author of UnityPack ( - nowadays, the fork UnityPy is more powerful and maintained:

    It's in Python and is able to deserialize Unity archives, treating them as a serialization format rather than a simple archive format. Feel free to email me if you want to integrate something like this or you have questions :)

  • ADBench

    Official Implement of "ADBench: Anomaly Detection Benchmark".

    Project mention: ADBench: Anomaly Detection Benchmark | | 2022-06-30
  • ail-framework

    AIL framework - Analysis Information Leak framework

  • matrixprofile

    A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms, accessible to everyone.

    Project mention: matrixprofile: NEW Data - star count:258.0 | | 2022-05-21
  • lasio

    Python library for reading and writing well data using Log ASCII Standard (LAS) files

    Project mention: Any geolog users here? Managing log data | | 2022-02-10

    I might try to merge the .las files using lasio (

  • grimoirelab-perceval

    Send Sir Perceval on a quest to retrieve and gather data from software repositories.

  • PyPOTS

    A python toolbox / library for data mining on partially-observed time series, supporting tasks of forecasting / imputation / classification / clustering on incomplete (irregularly-sampled) multivariate time series with missing values.

    Project mention: PyPOTS: NEW Data - star count:182.0 | | 2023-01-14
  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-02-02.

Python Data Mining related posts


What are some of the best open-source Data Mining projects in Python? This list will help you:

Project Stars
1 ML-From-Scratch 21,894
2 EasyOCR 16,848
3 gensim 13,910
4 anomaly-detection-resources 6,807
5 pyod 6,677
6 sktime 6,077
7 orange 3,919
8 pdftabextract 2,037
9 invoice2data 1,362
10 pycm 1,347
11 awesome-fraud-detection-papers 1,275
12 deep_gcns_torch 1,002
13 nfstream 903
14 URS 550
15 instascrape 527
16 pm4py-core 514
17 UnityPy 487
18 ADBench 485
19 ail-framework 344
20 matrixprofile 303
21 lasio 301
22 grimoirelab-perceval 266
23 PyPOTS 198
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives