Data Mining

Top 23 Data Mining Open-Source Projects

  • ML-From-Scratch

    Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

  • awesome-datascience

    :memo: An awesome Data Science repository to learn and apply for real world problems.

  • Project mention: About Data analyst, data scientist and data engineer, resources and experiences | dev.to | 2024-03-26

    Awesome Data Science by Academic

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • EasyOCR

    Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

  • Project mention: Leveraging GPT-4 for PDF Data Extraction: A Comprehensive Guide | dev.to | 2023-12-27

    PyTesseract Module [ Github ] EasyOCR Module [ Github ] PaddlePaddle OCR [ Github ]

  • LightGBM

    A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

  • Project mention: SIRUS.jl: Interpretable Machine Learning via Rule Extraction | /r/Julia | 2023-06-29

    SIRUS.jl is a pure Julia implementation of the SIRUS algorithm by BΓ©nard et al. (2021). The algorithm is a rule-based machine learning model meaning that it is fully interpretable. The algorithm does this by firstly fitting a random forests and then converting this forest to rules. Furthermore, the algorithm is stable and achieves a predictive performance that is comparable to LightGBM, a state-of-the-art gradient boosting model created by Microsoft. Interpretability, stability, and predictive performance are described in more detail below.

  • awesome-production-machine-learning

    A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning

  • Project mention: Exploring Open-Source Alternatives to Landing AI for Robust MLOps | dev.to | 2023-12-13

    One trove of treasures is the awesome-production-machine-learning repository on GitHub. This curated list provides a multitude of frameworks, libraries, and software designed to facilitate various stages of the ML lifecycle.

  • gensim

    Topic Modelling for Humans

  • Project mention: Aggregating news from different sources | /r/learnprogramming | 2023-07-08
  • python-machine-learning-book

    The "Python Machine Learning (1st edition)" book code repository and info resource

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • pyod

    A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)

  • Project mention: A Comprehensive Guide for Building Rag-Based LLM Applications | news.ycombinator.com | 2023-09-13

    This is a feature in many commercial products already, as well as open source libraries like PyOD. https://github.com/yzhao062/pyod

  • anomaly-detection-resources

    Anomaly detection related books, papers, videos, and toolboxes

  • Project mention: anomaly-detection-resources: NEW Extended Research - star count:7507.0 | /r/algoprojects | 2023-10-24
  • catboost

    A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

  • Project mention: CatBoost: Open-source gradient boosting library | news.ycombinator.com | 2024-03-05
  • sktime

    A unified framework for machine learning with time series

  • awesome-ml-for-cybersecurity

    :octocat: Machine Learning for Cyber Security

  • Ferret

    Declarative web scraping

  • orange

    🍊 :bar_chart: :bulb: Orange: Interactive data analysis

  • Project mention: Hierarchical Clustering | news.ycombinator.com | 2024-04-20

    I know I've tooted its horn before, but Orange3 is a pretty neat Python-based GUI platform that makes this and a metric buttload of other statistical/ML techniques available to non-programmer types.

    Just watch out for null character `x00` in the corpus. That always seems to kill it stone dead.

    https://orangedatamining.com/

    https://orange3.readthedocs.io/projects/orange-visual-progra...

  • datascience

    Curated list of Python resources for data science.

  • textract

    extract text from any document. no muss. no fuss.

  • kaggle-solutions

    πŸ… Collection of Kaggle Solutions and Ideas πŸ…

  • awesome-TS-anomaly-detection

    List of tools & datasets for anomaly detection on time-series data.

  • Project mention: awesome-TS-anomaly-detection: NEW Data - star count:2694.0 | /r/algoprojects | 2023-11-21
  • WebPlotDigitizer

    Computer vision assisted tool to extract numerical data from plot images.

  • Project mention: Digitized Continuous Magnetic Recordings for the 1859 Carrington Event | news.ycombinator.com | 2024-04-23

    Something similar which is more recently-maintained: https://github.com/automeris-io/WebPlotDigitizer

  • bolt

    10x faster matrix and vector operations (by dblalock)

  • graphic-walker

    An open source alternative to Tableau. Embeddable visual analytic

  • Project mention: Show HN: Open-source, browser-local data exploration using DuckDB-WASM and PRQL | news.ycombinator.com | 2024-03-15

    [2] https://github.com/Kanaries/graphic-walker/issues/330

  • pdftabextract

    A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

  • invoice2data

    Extract structured data from PDF invoices

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Data Mining related posts

Index

What are some of the best open-source Data Mining projects? This list will help you:

Project Stars
1 ML-From-Scratch 23,164
2 awesome-datascience 23,101
3 EasyOCR 21,882
4 LightGBM 16,043
5 awesome-production-machine-learning 15,947
6 gensim 15,236
7 python-machine-learning-book 12,076
8 pyod 7,941
9 anomaly-detection-resources 7,858
10 catboost 7,744
11 sktime 7,404
12 awesome-ml-for-cybersecurity 6,769
13 Ferret 5,616
14 orange 4,604
15 datascience 4,071
16 textract 3,778
17 kaggle-solutions 3,745
18 awesome-TS-anomaly-detection 2,811
19 WebPlotDigitizer 2,496
20 bolt 2,463
21 graphic-walker 2,223
22 pdftabextract 2,152
23 invoice2data 1,694

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com