Python Data Mining

Open-source Python projects categorized as Data Mining

Top 23 Python Data Mining Projects

  • ML-From-Scratch

    Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

  • EasyOCR

    Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

    Project mention: Leveraging GPT-4 for PDF Data Extraction: A Comprehensive Guide | dev.to | 2023-12-27

    PyTesseract Module [ Github ] EasyOCR Module [ Github ] PaddlePaddle OCR [ Github ]

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • gensim

    Topic Modelling for Humans

    Project mention: Aggregating news from different sources | /r/learnprogramming | 2023-07-08
  • pyod

    A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)

    Project mention: A Comprehensive Guide for Building Rag-Based LLM Applications | news.ycombinator.com | 2023-09-13

    This is a feature in many commercial products already, as well as open source libraries like PyOD. https://github.com/yzhao062/pyod

  • anomaly-detection-resources

    Anomaly detection related books, papers, videos, and toolboxes

    Project mention: anomaly-detection-resources: NEW Extended Research - star count:7507.0 | /r/algoprojects | 2023-10-24
  • catboost

    A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

    Project mention: CatBoost: Open-source gradient boosting library | news.ycombinator.com | 2024-03-05
  • sktime

    A unified framework for machine learning with time series

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • orange

    🍊 :bar_chart: :bulb: Orange: Interactive data analysis

    Project mention: Ask HN: What Underrated Open Source Project Deserves More Recognition? | news.ycombinator.com | 2024-03-07
  • pdftabextract

    A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

  • invoice2data

    Extract structured data from PDF invoices

  • awesome-fraud-detection-papers

    A curated list of data mining papers about fraud detection.

    Project mention: awesome-fraud-detection-papers: NEW Extended Research - star count:1346.0 | /r/algoprojects | 2023-05-13
  • pycm

    Multi-class confusion matrix library in Python

    Project mention: PyCM 4.0 Released: Multilabel Confusion Matrix Support | /r/coolgithubprojects | 2023-06-07
  • CleverCSV

    CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.

    Project mention: Parquet: more than just "Turbo CSV" | /r/programming | 2023-04-03

    There’s things like this, but I consider the existence of messy, non standard CSV files (backed by a decade of experience dealing with the problem) a strong reason to not use the format ever.

  • deep_gcns_torch

    Pytorch Repo for DeepGCNs (ICCV'2019 Oral, TPAMI'2021), DeeperGCN (arXiv'2020) and GNN1000(ICML'2021): https://www.deepgcns.org

  • nfstream

    NFStream: a Flexible Network Data Analysis Framework.

  • aeon

    A toolkit for conducting machine learning tasks with time series data

    Project mention: FLaNK 15 Jan 2024 | dev.to | 2024-01-15
  • ADBench

    Official Implement of "ADBench: Anomaly Detection Benchmark", NeurIPS 2022.

  • UnityPy

    UnityPy is python module that makes it possible to extract/unpack and edit Unity assets

  • pm4py-core

    Public repository for the PM4Py (Process Mining for Python) project.

  • PyPOTS

    A Python toolbox/library for reality-centric machine/deep learning and data mining on partially-observed time series with PyTorch, including SOTA neural network models for science tasks of imputation, classification, clustering, and forecasting on incomplete (irregularly-sampled) multivariate time series with NaN missing values/data.

    Project mention: Missing values in time series collected from the real world are common to see and very pesky. A new state-of-the-art and fast neural network called SAITS is proposed to impute missing data in partially-observed multivariate time series. The code is open source on GitHub. | /r/datascience | 2023-06-28

    Oh, wow, thanks for sharing it here! PyPOTS still has a long way to go, and I'm making it better. If you have any suggestions for PyPOTS, please let me know. Your feedback is always welcome and means a lot to the community of PyPOTS! If you like PyPOTS, please star 🌟 PyPOTS repo on GitHub and share it with people you know who may need it to help others notice this helpful work. Thank you very much!

  • ail-framework

    AIL framework - Analysis Information Leak framework

    Project mention: Ask HN: Show me your half baked project | news.ycombinator.com | 2023-10-12

    First time coming across this, looks very cool! Definitely some ideas there that I'd like to implement for osintbuddy. Another project I'm going to be taking some ideas from is: https://github.com/ail-project/ail-framework - a modular framework to analyse potential information leaks

  • matrixprofile

    A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms, accessible to everyone.

  • grimoirelab-perceval

    Send Sir Perceval on a quest to retrieve and gather data from software repositories.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-03-07.

Python Data Mining related posts

Index

What are some of the best open-source Data Mining projects in Python? This list will help you:

Project Stars
1 ML-From-Scratch 23,004
2 EasyOCR 21,448
3 gensim 15,074
4 pyod 7,854
5 anomaly-detection-resources 7,787
6 catboost 7,668
7 sktime 7,269
8 orange 4,551
9 pdftabextract 2,129
10 invoice2data 1,662
11 awesome-fraud-detection-papers 1,521
12 pycm 1,423
13 CleverCSV 1,197
14 deep_gcns_torch 1,104
15 nfstream 1,033
16 aeon 760
17 ADBench 752
18 UnityPy 700
19 pm4py-core 625
20 PyPOTS 596
21 ail-framework 463
22 matrixprofile 352
23 grimoirelab-perceval 285
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com