SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Data Mining Projects
-
ML-From-Scratch
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.
-
EasyOCR
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
catboost
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
pdftabextract
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
-
CleverCSV
CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.
-
deep_gcns_torch
Pytorch Repo for DeepGCNs (ICCV'2019 Oral, TPAMI'2021), DeeperGCN (arXiv'2020) and GNN1000(ICML'2021): https://www.deepgcns.org
-
PyPOTS
A Python toolbox/library for reality-centric machine/deep learning and data mining on partially-observed time series with PyTorch, including SOTA neural network models for science tasks of imputation, classification, clustering, and forecasting on incomplete (irregularly-sampled) multivariate time series with NaN missing values/data.
-
matrixprofile
A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms, accessible to everyone.
-
grimoirelab-perceval
Send Sir Perceval on a quest to retrieve and gather data from software repositories.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Leveraging GPT-4 for PDF Data Extraction: A Comprehensive Guide | dev.to | 2023-12-27PyTesseract Module [ Github ] EasyOCR Module [ Github ] PaddlePaddle OCR [ Github ]
Project mention: A Comprehensive Guide for Building Rag-Based LLM Applications | news.ycombinator.com | 2023-09-13This is a feature in many commercial products already, as well as open source libraries like PyOD. https://github.com/yzhao062/pyod
Project mention: anomaly-detection-resources: NEW Extended Research - star count:7507.0 | /r/algoprojects | 2023-10-24
Project mention: CatBoost: Open-source gradient boosting library | news.ycombinator.com | 2024-03-05
I know I've tooted its horn before, but Orange3 is a pretty neat Python-based GUI platform that makes this and a metric buttload of other statistical/ML techniques available to non-programmer types.
Just watch out for null character `x00` in the corpus. That always seems to kill it stone dead.
https://orangedatamining.com/
https://orange3.readthedocs.io/projects/orange-visual-progra...
Project mention: awesome-fraud-detection-papers: NEW Extended Research - star count:1346.0 | /r/algoprojects | 2023-05-13
Project mention: PyCM 4.0 Released: Multilabel Confusion Matrix Support | /r/coolgithubprojects | 2023-06-07
Project mention: Missing values in time series collected from the real world are common to see and very pesky. A new state-of-the-art and fast neural network called SAITS is proposed to impute missing data in partially-observed multivariate time series. The code is open source on GitHub. | /r/datascience | 2023-06-28Oh, wow, thanks for sharing it here! PyPOTS still has a long way to go, and I'm making it better. If you have any suggestions for PyPOTS, please let me know. Your feedback is always welcome and means a lot to the community of PyPOTS! If you like PyPOTS, please star 🌟 PyPOTS repo on GitHub and share it with people you know who may need it to help others notice this helpful work. Thank you very much!
First time coming across this, looks very cool! Definitely some ideas there that I'd like to implement for osintbuddy. Another project I'm going to be taking some ideas from is: https://github.com/ail-project/ail-framework - a modular framework to analyse potential information leaks
Python Data Mining related posts
- Hierarchical Clustering
- Orange Data Mining
- The Graph of Wikipedia [video]
- Taxonomy Management?
- Orange: Open-source machine learning and data visualization
- Aeon: A unified framework for machine learning with time series
- What exactly is AutoGPT?
-
A note from our sponsor - SaaSHub
www.saashub.com | 25 Apr 2024
Index
What are some of the best open-source Data Mining projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | ML-From-Scratch | 23,164 |
2 | EasyOCR | 21,882 |
3 | gensim | 15,236 |
4 | pyod | 7,941 |
5 | anomaly-detection-resources | 7,858 |
6 | catboost | 7,744 |
7 | sktime | 7,404 |
8 | orange | 4,604 |
9 | pdftabextract | 2,152 |
10 | invoice2data | 1,685 |
11 | awesome-fraud-detection-papers | 1,545 |
12 | pycm | 1,429 |
13 | CleverCSV | 1,213 |
14 | deep_gcns_torch | 1,104 |
15 | nfstream | 1,042 |
16 | aeon | 794 |
17 | ADBench | 770 |
18 | UnityPy | 720 |
19 | PyPOTS | 660 |
20 | pm4py-core | 639 |
21 | ail-framework | 495 |
22 | matrixprofile | 354 |
23 | grimoirelab-perceval | 284 |
Sponsored