Top 23 Python Data Mining Projects
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.Project mention: Neural Network from Scratch | news.ycombinator.com | 2022-01-04
Interesting find. Just FYI, this repo has been the OG for several years, when it comes to building NN from scratch:
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.Project mention: [Question] Best approach for Optical Character recognition on large (20MB+) photos? | reddit.com/r/opencv | 2021-11-10
Try easyocr or Tesseract. Both are pretty easy to use and don't need much background in OpenCV.
OPS - Build and Run Open Source Unikernels. Quickly and easily build and deploy open source unikernels in tens of seconds. Deploy in any language to any cloud.
Topic Modelling for HumansProject mention: Topic modelling with Gensim and SpaCy on startup news | dev.to | 2022-01-17
For the topic modelling itself, I am going to use Gensim library by Radim Rehurek, which is very developer friendly and easy to use.
Anomaly detection related books, papers, videos, and toolboxesProject mention: anomaly-detection-resources: NEW Extended Research - star count:5415.0 | reddit.com/r/algoprojects | 2022-01-19
(JMLR' 19) A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)Project mention: [D] Unsupervised Outlier Detection - Advise Requested | reddit.com/r/MachineLearning | 2021-12-03
The source code and documentaion of PyOD is the best survey about OOD. Besides, the normalized flow and VQVAE are also feasible.
A unified framework for machine learning with time seriesProject mention: Good python time series libraries? | reddit.com/r/algotrading | 2021-12-13
🍊 :bar_chart: :bulb: Orange: Interactive data analysisProject mention: ETL Library for Python | reddit.com/r/Python | 2021-09-27
"On the simpler side". Do you mean with a graphical interface? Then, orange would be a nice solution. https://orangedatamining.com/
Deliver Cleaner and Safer Code - Right in Your IDE of Choice!. SonarLint is a free and open source IDE extension that identifies and catches bugs and vulnerabilities as you code, directly in the IDE. Install from your favorite IDE marketplace today.
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
Find your trading edge, using the fastest engine for backtesting, algorithmic trading, and research.Project mention: Repost with explanation - OOS Testing cluster | reddit.com/r/algotrading | 2022-01-01
I second the idea of looking through software optimization, but there is no need to jump right to C. I would look at something like vectorbt. You get the speed of C running under the hood while staying in Python for your back testing code
Multi-class confusion matrix library in PythonProject mention: [P] PyCM 3.3 released: Comparison of Classifiers Based on Confusion Matrix | reddit.com/r/MachineLearning | 2021-10-27
Extract structured data from PDF invoicesProject mention: Base64.ai – Extract text, data, photos and more from all types of docs | news.ycombinator.com | 2021-02-10
It's not really working. Tried 2 English PDF invoices. Normal format. One came back empty, the other only had the amount right.
I'm assuming they only trained on some specific documents (passport of country X, etc) and all others don't work.
If someone processes the same document all the time, then my invoice2data project may work better and is open source. It's based on Regx, rather than machine learning: https://github.com/invoice-x/invoice2data
A curated list of data mining papers about fraud detection.Project mention: awesome-fraud-detection-papers: NEW Extended Research - star count:1039.0 | reddit.com/r/algoprojects | 2022-01-15
NFStream: a Flexible Network Data Analysis Framework.Project mention: Open Source Deep Packet Inspection Using Python | news.ycombinator.com | 2021-07-02
GitHub project: https://github.com/nfstream/nfstream
Community feedbacks and contributions are welcome!
Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.Project mention: Question for using the Universal Reddit Scraper (URS) | reddit.com/r/learnpython | 2021-12-19
I'm new to python, and coding generally. I'm using a great tool called the Universal Reddit Scraper (https://github.com/JosephLai241/URS) to pull some reddit data. It allows you to scrape subreddits, among other things. It creates a CSV file with list of submissions in a given subreddit with each one's ID as a column.
Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmaticallyProject mention: Question about Instagram scraping problem for Thesis (Too big size of data to scrape) | reddit.com/r/learnpython | 2021-11-09
Link to the open source package I used: https://github.com/chris-greening/instascrape
AIL framework - Analysis Information Leak framework
Python class to scrape data from rightmove.co.uk and return listings in a pandas DataFrame objectProject mention: Need help on finding the right xpath for web scrapping | reddit.com/r/learnpython | 2021-04-01
Lithology and stratigraphic logs for wells or outcrop.Project mention: Software for visualization core data | reddit.com/r/geologycareers | 2021-10-14
A light-weight, extendable, high level, universal code parser built on top of tree-sitterProject mention: Tree Sitter and the Complications of Parsing Languages | news.ycombinator.com | 2021-11-24
tree-sitter is a great framework. I have used it quite a bit in past. I even created a small library on top of it, called tree-hugger (https://github.com/autosoft-dev/tree-hugger) Really enjoyed their playground as well.
Retrieve years of imgur.com's data without any authentication.Project mention: Download Almost a Decade Of Imgur Data Without Authentication | reddit.com/r/DataHoarder | 2021-09-30
Here's the repo; please don't hesitate to report a bug or maybe help out by helping fix the issue. The new release is slower but gets more data than the older releases. Thoughts and feedbacks are welcome!
Feature engineering and machine learning: together at last!Project mention: Why isn't differential dataflow more popular? | news.ycombinator.com | 2021-01-22
It will return the sum of all values in column A. For large tables it will take some time to compute the result. Now assume we append a new record and want to get the new result. The traditional approach is execute this query again. A better approach is to process this new record only by adding its value in A to the result of the previous query. It is important in (stateful) stream processing.
Something similar is implemented in these libraries which however rely on a different data processing conception (alternative to map-reduce):
https://github.com/asavinov/prosto - Functions matter! No join-groupby, No map-reduce.
https://github.com/asavinov/lambdo - Feature engineering and machine learning: together at last!
A Telegram crawler made in Python to search groups and channels automatically and collect any type of data from them.Project mention: A Telegram groups crawler | reddit.com/r/SideProject | 2022-01-08
Hi everyone, this is my side project in Python: https://github.com/edogab33/telegram-groups-crawler
Inspired by recent advances in coverage-guided analysis of neural networks, we propose a novel anomaly detection method. We show that the hidden activation values contain information useful to distinguish between normal and anomalous samples. Our approach combines three neural networks in a purely data-driven end-to-end model. Based on the activation values in the target network, the alarm network decides if the given sample is normal. Thanks to the anomaly network, our method even works in stri
Python Data Mining related posts
Question for using the Universal Reddit Scraper (URS)
1 project | reddit.com/r/learnpython | 19 Dec 2021
Question about Instagram scraping problem for Thesis (Too big size of data to scrape)
1 project | reddit.com/r/learnpython | 9 Nov 2021
Software for visualization core data
1 project | reddit.com/r/geologycareers | 14 Oct 2021
ETL Library for Python
1 project | reddit.com/r/Python | 27 Sep 2021
Hi r/dota2, I'm excited for TI so I made a dataset of competitive matches.
1 project | reddit.com/r/DotA2 | 22 Sep 2021
[D] Why Hasn't FOSS Drag-and-Drop ML tools taken off yet?
2 projects | reddit.com/r/MachineLearning | 8 Sep 2021
Open Source Deep Packet Inspection Using Python
1 project | news.ycombinator.com | 2 Jul 2021
What are some of the best open-source Data Mining projects in Python? This list will help you:
Are you hiring? Post a new remote job listing for free.