Data Mining

Open-source projects categorized as Data Mining Edit details

Top 23 Data Mining Open-Source Projects

  • ML-From-Scratch

    Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

    Project mention: Neural Network from Scratch | | 2022-01-04

    Interesting find. Just FYI, this repo has been the OG for several years, when it comes to building NN from scratch:

  • awesome-datascience

    :memo: An awesome Data Science repository to learn and apply for real world problems.

    Project mention: High income skills? | | 2021-12-22

    There are several on github, such as:

  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • EasyOCR

    Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

    Project mention: [P] Training to read PDF documents. Any ideas? | | 2022-05-20

    If all you need to do is OCR, check out , it's a similar architecture to the cloud services, without all the $. You'll end up with extracted text and bounding boxes for it.

  • LightGBM

    A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

    Project mention: What's New with AWS: Amazon SageMaker built-in algorithms now provides four new Tabular Data Modeling Algorithms | | 2022-06-28

    LightGBM is a popular and high-performance open-source implementation of the Gradient Boosting Decision Tree (GBDT). To learn how to use this algorithm, please see example notebooks for Classification and Regression.

  • gensim

    Topic Modelling for Humans


    Here we have to install the gensim library in a jupyter notebook to be able to use it in our project, consider the code below;

  • awesome-production-machine-learning

    A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning

    Project mention: Sqldiff: SQLite Database Difference Utility | | 2022-05-04

  • python-machine-learning-book

    The "Python Machine Learning (1st edition)" book code repository and info resource

    Project mention: What is the purpose of meshgrid in Python / NumPy? | | 2022-01-06

    I am studying "Python Machine Learning" from Sebastian Raschka, and he is using it for plotting the decision borders. See input 11 here.

  • SonarLint

    Clean code begins in your IDE with SonarLint. Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.

  • catboost

    A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

    Project mention: What's New with AWS: Amazon SageMaker built-in algorithms now provides four new Tabular Data Modeling Algorithms | | 2022-06-28

    CatBoost is another popular and high-performance open-source implementation of the Gradient Boosting Decision Tree (GBDT). To learn how to use this algorithm, please see example notebooks for Classification and Regression.

  • anomaly-detection-resources

    Anomaly detection related books, papers, videos, and toolboxes

    Project mention: anomaly-detection-resources: NEW Extended Research - star count:6040.0 | | 2022-06-30
  • pyod

    A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)

    Project mention: Predictive Maintenance and Anomaly Detection Resources | | 2022-06-13
  • sktime

    A unified framework for machine learning with time series

    Project mention: Forecasting three months ahead. | | 2022-04-07
  • awesome-ml-for-cybersecurity

    :octocat: Machine Learning for Cyber Security

    Project mention: Machine learning in Cyber Security | | 2022-06-06

    There is a lot you can work on. You can start here : If I had the time, I'd play with this tool :

  • Ferret

    Declarative web scraping

  • orange

    🍊 :bar_chart: :bulb: Orange: Interactive data analysis

    Project mention: Clustering and Heat map software (mac) | | 2022-07-01

  • datascience

    Curated list of Python resources for data science.

    Project mention: Datascience Libraries for Python | | 2021-11-13
  • textract

    extract text from any document. no muss. no fuss.

    Project mention: How to give a file path to a file parser when you only have an HTTPRequest? | | 2022-01-13
  • awesome-TS-anomaly-detection

    List of tools & datasets for anomaly detection on time-series data.

    Project mention: Anomaly Detection in Time-Series | | 2022-03-28

    For a list of anomaly detection packages, look at this repo. For R the oddstream package could be worth a try. But yeah, maybe try out simpler solutions first. Sometimes deterministic rules can be enough. I worked on a problem with temperature sensors before and a combination of time series smoothing, deterministic rules, a small time frame and isolation forest worked best.

  • bolt

    10x faster matrix and vector operations (by dblalock)

    Project mention: Bolt: Faster matrix and vector operations that run on compressed data | | 2022-06-18
  • pdftabextract

    A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

  • WebPlotDigitizer

    HTML5 based online tool to extract numerical data from plot images.

    Project mention: Digitizing Well Logs | | 2022-06-16
  • kaggle-solutions

    🏅 Collection of Kaggle Solutions and Ideas 🏅

    Project mention: Collection of Kaggle Past Solutions (to learn ideas and techniques) | | 2022-04-18
  • tsv-utils

    eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

    Project mention: Splitting CSV files at 3GB/s | | 2022-06-20
  • pycm

    Multi-class confusion matrix library in Python

    Project mention: PyCM 3.5 released: Multi-class confusion matrix library in Python | | 2022-04-27
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-07-01.

Data Mining related posts


What are some of the best open-source Data Mining projects? This list will help you:

Project Stars
1 ML-From-Scratch 21,196
2 awesome-datascience 18,949
3 EasyOCR 15,117
4 LightGBM 13,926
5 gensim 13,304
6 awesome-production-machine-learning 11,817
7 python-machine-learning-book 11,598
8 catboost 6,613
9 anomaly-detection-resources 6,066
10 pyod 5,783
11 sktime 5,454
12 awesome-ml-for-cybersecurity 5,265
13 Ferret 5,024
14 orange 3,472
15 datascience 3,360
16 textract 3,261
17 awesome-TS-anomaly-detection 2,246
18 bolt 2,197
19 pdftabextract 1,994
20 WebPlotDigitizer 1,808
21 kaggle-solutions 1,599
22 tsv-utils 1,323
23 pycm 1,252
Find remote jobs at our new job board There are 2 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives