The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 23 Python Data Mining Projects
-
ML-From-Scratch
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.
-
EasyOCR
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
Project mention: Leveraging GPT-4 for PDF Data Extraction: A Comprehensive Guide | dev.to | 2023-12-27PyTesseract Module [ Github ] EasyOCR Module [ Github ] PaddlePaddle OCR [ Github ]
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
-
Project mention: A Comprehensive Guide for Building Rag-Based LLM Applications | news.ycombinator.com | 2023-09-13
This is a feature in many commercial products already, as well as open source libraries like PyOD. https://github.com/yzhao062/pyod
-
Project mention: anomaly-detection-resources: NEW Extended Research - star count:7507.0 | /r/algoprojects | 2023-10-24
-
catboost
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
Project mention: CatBoost: Open-source gradient boosting library | news.ycombinator.com | 2024-03-05 -
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Project mention: Ask HN: What Underrated Open Source Project Deserves More Recognition? | news.ycombinator.com | 2024-03-07
-
pdftabextract
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
-
-
Project mention: awesome-fraud-detection-papers: NEW Extended Research - star count:1346.0 | /r/algoprojects | 2023-05-13
-
Project mention: PyCM 4.0 Released: Multilabel Confusion Matrix Support | /r/coolgithubprojects | 2023-06-07
-
CleverCSV
CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.
There’s things like this, but I consider the existence of messy, non standard CSV files (backed by a decade of experience dealing with the problem) a strong reason to not use the format ever.
-
deep_gcns_torch
Pytorch Repo for DeepGCNs (ICCV'2019 Oral, TPAMI'2021), DeeperGCN (arXiv'2020) and GNN1000(ICML'2021): https://www.deepgcns.org
-
-
-
-
-
-
PyPOTS
A Python toolbox/library for reality-centric machine/deep learning and data mining on partially-observed time series with PyTorch, including SOTA neural network models for science tasks of imputation, classification, clustering, and forecasting on incomplete (irregularly-sampled) multivariate time series with NaN missing values/data.
Project mention: Missing values in time series collected from the real world are common to see and very pesky. A new state-of-the-art and fast neural network called SAITS is proposed to impute missing data in partially-observed multivariate time series. The code is open source on GitHub. | /r/datascience | 2023-06-28Oh, wow, thanks for sharing it here! PyPOTS still has a long way to go, and I'm making it better. If you have any suggestions for PyPOTS, please let me know. Your feedback is always welcome and means a lot to the community of PyPOTS! If you like PyPOTS, please star 🌟 PyPOTS repo on GitHub and share it with people you know who may need it to help others notice this helpful work. Thank you very much!
-
First time coming across this, looks very cool! Definitely some ideas there that I'd like to implement for osintbuddy. Another project I'm going to be taking some ideas from is: https://github.com/ail-project/ail-framework - a modular framework to analyse potential information leaks
-
matrixprofile
A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms, accessible to everyone.
-
grimoirelab-perceval
Send Sir Perceval on a quest to retrieve and gather data from software repositories.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Data Mining related posts
- Taxonomy Management?
- Orange: Open-source machine learning and data visualization
- Aeon: A unified framework for machine learning with time series
- What exactly is AutoGPT?
- Why don't more people use Altair for python Visualizations instead of Plotly?
- Advice on Transitioning to Data Science/ML/AI without Coding Experience
- Has anybody used Orange?
-
A note from our sponsor - WorkOS
workos.com | 19 Mar 2024
Index
What are some of the best open-source Data Mining projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | ML-From-Scratch | 23,004 |
2 | EasyOCR | 21,448 |
3 | gensim | 15,074 |
4 | pyod | 7,854 |
5 | anomaly-detection-resources | 7,787 |
6 | catboost | 7,668 |
7 | sktime | 7,269 |
8 | orange | 4,551 |
9 | pdftabextract | 2,129 |
10 | invoice2data | 1,662 |
11 | awesome-fraud-detection-papers | 1,521 |
12 | pycm | 1,423 |
13 | CleverCSV | 1,197 |
14 | deep_gcns_torch | 1,104 |
15 | nfstream | 1,033 |
16 | aeon | 760 |
17 | ADBench | 752 |
18 | UnityPy | 700 |
19 | pm4py-core | 625 |
20 | PyPOTS | 596 |
21 | ail-framework | 463 |
22 | matrixprofile | 352 |
23 | grimoirelab-perceval | 285 |