Python Data Mining

Open-source Python projects categorized as Data Mining | Edit details

Top 23 Python Data Mining Projects

  • GitHub repo ML-From-Scratch

    Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

    Project mention: Neural Network from Scratch | | 2022-01-04

    Interesting find. Just FYI, this repo has been the OG for several years, when it comes to building NN from scratch:

  • GitHub repo EasyOCR

    Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

    Project mention: [Question] Best approach for Optical Character recognition on large (20MB+) photos? | | 2021-11-10

    Try easyocr or Tesseract. Both are pretty easy to use and don't need much background in OpenCV.

  • OPS

    OPS - Build and Run Open Source Unikernels. Quickly and easily build and deploy open source unikernels in tens of seconds. Deploy in any language to any cloud.

  • GitHub repo gensim

    Topic Modelling for Humans

    Project mention: Topic modelling with Gensim and SpaCy on startup news | | 2022-01-17

    For the topic modelling itself, I am going to use Gensim library by Radim Rehurek, which is very developer friendly and easy to use.

  • GitHub repo anomaly-detection-resources

    Anomaly detection related books, papers, videos, and toolboxes

    Project mention: anomaly-detection-resources: NEW Extended Research - star count:5415.0 | | 2022-01-19
  • GitHub repo pyod

    (JMLR' 19) A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)

    Project mention: [D] Unsupervised Outlier Detection - Advise Requested | | 2021-12-03

    The source code and documentaion of PyOD is the best survey about OOD. Besides, the normalized flow and VQVAE are also feasible.

  • GitHub repo sktime

    A unified framework for machine learning with time series

    Project mention: Good python time series libraries? | | 2021-12-13


  • GitHub repo orange

    🍊 :bar_chart: :bulb: Orange: Interactive data analysis

    Project mention: ETL Library for Python | | 2021-09-27

    "On the simpler side". Do you mean with a graphical interface? Then, orange would be a nice solution.

  • SonarLint

    Deliver Cleaner and Safer Code - Right in Your IDE of Choice!. SonarLint is a free and open source IDE extension that identifies and catches bugs and vulnerabilities as you code, directly in the IDE. Install from your favorite IDE marketplace today.

  • GitHub repo pdftabextract

    A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

  • GitHub repo vectorbt

    Find your trading edge, using the fastest engine for backtesting, algorithmic trading, and research.

    Project mention: Repost with explanation - OOS Testing cluster | | 2022-01-01

    I second the idea of looking through software optimization, but there is no need to jump right to C. I would look at something like vectorbt. You get the speed of C running under the hood while staying in Python for your back testing code

  • GitHub repo pycm

    Multi-class confusion matrix library in Python

    Project mention: [P] PyCM 3.3 released: Comparison of Classifiers Based on Confusion Matrix | | 2021-10-27
  • GitHub repo invoice2data

    Extract structured data from PDF invoices

    Project mention: – Extract text, data, photos and more from all types of docs | | 2021-02-10

    It's not really working. Tried 2 English PDF invoices. Normal format. One came back empty, the other only had the amount right.

    I'm assuming they only trained on some specific documents (passport of country X, etc) and all others don't work.

    If someone processes the same document all the time, then my invoice2data project may work better and is open source. It's based on Regx, rather than machine learning:

  • GitHub repo awesome-fraud-detection-papers

    A curated list of data mining papers about fraud detection.

    Project mention: awesome-fraud-detection-papers: NEW Extended Research - star count:1039.0 | | 2022-01-15
  • GitHub repo nfstream

    NFStream: a Flexible Network Data Analysis Framework.

    Project mention: Open Source Deep Packet Inspection Using Python | | 2021-07-02

    GitHub project:

    Community feedbacks and contributions are welcome!

  • GitHub repo URS

    Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

    Project mention: Question for using the Universal Reddit Scraper (URS) | | 2021-12-19

    I'm new to python, and coding generally. I'm using a great tool called the Universal Reddit Scraper ( to pull some reddit data. It allows you to scrape subreddits, among other things. It creates a CSV file with list of submissions in a given subreddit with each one's ID as a column.

  • GitHub repo instascrape

    Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically

    Project mention: Question about Instagram scraping problem for Thesis (Too big size of data to scrape) | | 2021-11-09

    Link to the open source package I used:

  • GitHub repo ail-framework

    AIL framework - Analysis Information Leak framework

  • GitHub repo

    Python class to scrape data from and return listings in a pandas DataFrame object

    Project mention: Need help on finding the right xpath for web scrapping | | 2021-04-01
  • GitHub repo striplog

    Lithology and stratigraphic logs for wells or outcrop.

    Project mention: Software for visualization core data | | 2021-10-14
  • GitHub repo tree-hugger

    A light-weight, extendable, high level, universal code parser built on top of tree-sitter

    Project mention: Tree Sitter and the Complications of Parsing Languages | | 2021-11-24

    tree-sitter is a great framework. I have used it quite a bit in past. I even created a small library on top of it, called tree-hugger ( Really enjoyed their playground as well.

  • GitHub repo imgur-scraper

    Retrieve years of's data without any authentication.

    Project mention: Download Almost a Decade Of Imgur Data Without Authentication | | 2021-09-30

    Here's the repo; please don't hesitate to report a bug or maybe help out by helping fix the issue. The new release is slower but gets more data than the older releases. Thoughts and feedbacks are welcome!

  • GitHub repo lambdo

    Feature engineering and machine learning: together at last!

    Project mention: Why isn't differential dataflow more popular? | | 2021-01-22

    It will return the sum of all values in column A. For large tables it will take some time to compute the result. Now assume we append a new record and want to get the new result. The traditional approach is execute this query again. A better approach is to process this new record only by adding its value in A to the result of the previous query. It is important in (stateful) stream processing.

    Something similar is implemented in these libraries which however rely on a different data processing conception (alternative to map-reduce): - Functions matter! No join-groupby, No map-reduce. - Feature engineering and machine learning: together at last!

  • GitHub repo telegram-groups-crawler

    A Telegram crawler made in Python to search groups and channels automatically and collect any type of data from them.

    Project mention: A Telegram groups crawler | | 2022-01-08

    Hi everyone, this is my side project in Python:

  • GitHub repo A3

    Inspired by recent advances in coverage-guided analysis of neural networks, we propose a novel anomaly detection method. We show that the hidden activation values contain information useful to distinguish between normal and anomalous samples. Our approach combines three neural networks in a purely data-driven end-to-end model. Based on the activation values in the target network, the alarm network decides if the given sample is normal. Thanks to the anomaly network, our method even works in stri

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-01-19.

Python Data Mining related posts


What are some of the best open-source Data Mining projects in Python? This list will help you:

Project Stars
1 ML-From-Scratch 20,747
2 EasyOCR 13,570
3 gensim 12,834
4 anomaly-detection-resources 5,425
5 pyod 5,181
6 sktime 4,833
7 orange 3,211
8 pdftabextract 1,968
9 vectorbt 1,584
10 pycm 1,200
11 invoice2data 1,124
12 awesome-fraud-detection-papers 1,045
13 nfstream 753
14 URS 418
15 instascrape 395
16 ail-framework 274
17 157
18 striplog 151
19 tree-hugger 88
20 imgur-scraper 26
21 lambdo 11
22 telegram-groups-crawler 6
23 A3 6
Find remote jobs at our new job board There are 28 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
Less time debugging, more time building
Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.