InfluxDB is the Time Series Data Platform where developers build real-time applications for analytics, IoT and cloud-native services in less time with less code. Learn more →
Top 23 Python Data Science Projects
Deep Learning for humansProject mention: Weekly Quant Update 10.11.22 - Surviving a fundamental crisis with trading bots | reddit.com/r/u_KappaTrading | 2022-11-10
All strategies share some common traits: They all use Neural Net libraries. 2 use TensorFlow The other uses python Keras Library https://github.com/keras-team/keras
scikit-learn: machine learning in PythonProject mention: Scaling PostgresML to 1M Requests per Second | news.ycombinator.com | 2022-11-11
Of course. The paper is at https://arxiv.org/abs/1408.3060.
> Our method applies to any translation invariant and any dot-product kernel, such as the popular RBF kernels and polynomial kernels. We prove that the approximation is unbiased and has low variance. Experiments show that we achieve similar accuracy to full kernel expansions and Random Kitchen Sinks while being 100x faster and using 1000x less memory. These improvements, especially in terms of memory usage, make kernel methods more practical for applications that have large training sets and/or require real-time prediction.
Sadly Fastfood didn't quite make it into Scikit, but did land in scikit-learn-extra.
1. https://github.com/scikit-learn/scikit-learn/pull/3665. A shame, Scikit's equivalents scale very poorly.
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
💫 Industrial-strength Natural Language Processing (NLP) in PythonProject mention: Has anyone here ever used the seaNMF model for short text topic modeling, and be willing to help me get started with it? | reddit.com/r/LanguageTechnology | 2022-11-24
Tokenize with NLTK, SpaCy or CoreNLP
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for accelerating ML workloads.Project mention: Think about it for a second | reddit.com/r/mathmemes | 2022-10-19
https://ray.io (just dropping the link)
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.Project mention: Coding K-Means Clustering using Python and NumPy | dev.to | 2022-09-22
ML From Scratch - An excellent Github repository containing implementations of many machine learning models and algorithms. Easy to understand and highly recommended.
Streamlit — The fastest way to build data apps in PythonProject mention: Advent of Code - Day Downloader - Website | reddit.com/r/adventofcode | 2022-11-27
I made a Python streamlit web page to select and download the question and/or input of multiple days on Advent of Code.
Truly a developer’s best friend. Scout APM is great for developers who want to find and fix performance issues in their applications. With Scout, we'll take care of the bugs so you can focus on building great things 🚀.
Build and train PyTorch models and connect them to the ML lifecycle using Lightning App templates, without handling DIY infrastructure, cost management, scaling, and other headaches.Project mention: We just release a complete open-source solution for accelerating Stable Diffusion pretraining and fine-tuning! | reddit.com/r/StableDiffusion | 2022-11-11
Our codebase for the diffusion models builds heavily on OpenAI's ADM codebase , lucidrains, Stable Diffusion, Lightning and Hugging Face. Thanks for open-sourcing!
looks like you can get it manually (albeit with a loss of interactivity) https://github.com/plotly/dash/issues/145
matplotlib: plotting with PythonProject mention: How to model the hanging chain PDE using numerical methods in Python? | reddit.com/r/learnpython | 2022-11-25
There are plenty of data visualization tools in python, but probably the easiest to get started with is Matplotlib
Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 400 universities from 60 countries including Stanford, MIT, Harvard, and Cambridge.Project mention: How to pre-train BERT on different objective tasks using HuggingFace | reddit.com/r/deeplearning | 2022-04-10
There might is bert library for pre-train bert model in huggingface, But I suggestion that you train bert model in native pytorch to understand detail, Limu's course is recommended for you
Official repository for IPython itself. Other repos in the IPython organization contain things like the website, documentation builds, etc.Project mention: Pandas 1.5 released | reddit.com/r/Python | 2022-09-19
!pip install is error-prone, it is better to use %pip install, ipython even warns about this, https://github.com/ipython/ipython/pull/12954/
Best Practices on Recommendation SystemsProject mention: There is framework for everything. | reddit.com/r/ProgrammerHumor | 2022-08-04
Topic Modelling for HumansProject mention: Topic modeling --- allow multiple topics per statement | reddit.com/r/LanguageTechnology | 2022-11-22
Try LDA as implemented in gemsin https://github.com/RaRe-Technologies/gensim
An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.Project mention: Best-Of Machine Learning with Python | news.ycombinator.com | 2022-04-28
An open-source NLP research library, built on PyTorch.Project mention: How to solve ConfigurationError using HuggingFace Token Classifier | reddit.com/r/learnpython | 2022-10-08
No clue. So what I did was google the error. Here's what I found: https://github.com/allenai/allennlp/issues/4319
🦉Data Version Control | Git for Data & Models | ML Experiments ManagementProject mention: How do you manage results, plots, etc.? | reddit.com/r/bioinformatics | 2022-11-17
Bioinf has a lot of biologists who have transitioned into more technical/coding focused roles, so you'll find there's not a lot of engineering workflow standards out there compared to DS or SWE. As others have said, snakemake is the most common, but thats just a pipeline managment tool, it doesn't manage data or outputs. I personally use DVC for data and pipeline management (and include jupyter and papermill to make it all work), although I haven't yet gotten onboard with their experiments feature (which is what would manage different parameters and figures/results beyond versioning). I looked into MLflow and some other options when I was getting started (I do tool development and bioinf analysis), but I wanted data versioning to ensure experiment reproducibility (kind of a critcal part of science IMO), and many of the other solutions like Airflow (common in DS industry) seemed to be overkill for smaller bioinfo projects. DVC meets the requirements and I like it in concept, although in practice there have been many updates that have been a bit of a pain to keep up with/integrate. I've got a bioinfo/ds project template on github that roles together git, conda, DVC, jupyter and papermill to ensure experiment reproducibility, and is setup as a template that can be deployed with cookiecutter - check it out if you like.
The easiest way to build, run, and monitor data pipelines at scale.Project mention: Example typescript project repos? | reddit.com/r/typescript | 2022-10-27
If I was answering this question but for python, I'd recommend something like prefect, boto3, or tortoise-orm -- not extremely complex and with a pretty comprehensible featureset.
Statistical data visualization in PythonProject mention: Ever wondered why banking sites suck? | reddit.com/r/ProgrammerHumor | 2022-11-11
As a practical example let's look up a repository of an open source project, these are the stats of the first and second contributor as ranked by github:
Create HTML profiling reports from pandas DataFrame objectsProject mention: Data analysts: what’re some initial steps you take to get familiar w datasets? | reddit.com/r/Python | 2022-11-09
Since you already mention pandas, I can suggest that you profile the dataframe to get a better understanding of the data wrt e.g. distributions, data types, missing data and so forth. There exists handy tools for that like pandas-profiling
Deep learning library featuring a higher-level API for TensorFlow.Project mention: Beginner Friendly Resources to Master Artificial Intelligence and Machine Learning with Python (2022) | dev.to | 2022-08-14
TFLearn – Deep learning library featuring a higher-level API for TensorFlow
Data-centric declarative deep learning framework
The context switching struggle is real. Zigi makes context switching a thing of the past. It monitors Jira and GitHub updates, pings you when PRs need approval and lets you take fast actions - all directly from Slack!
Python Data Science related posts
[P] Metric learning: theory, practice, code examples
2 projects | reddit.com/r/MachineLearning | 26 Nov 2022
autogluon: NEW Data - star count:5070.0
1 project | reddit.com/r/algoprojects | 25 Nov 2022
Streamlit + DuckDB Tutorial
2 projects | dev.to | 25 Nov 2022
autogluon: NEW Data - star count:5070.0
1 project | reddit.com/r/algoprojects | 24 Nov 2022
How do I solve this “UnicodeEncodeError”?
1 project | reddit.com/r/learnpython | 23 Nov 2022
autogluon: NEW Data - star count:5070.0
1 project | reddit.com/r/algoprojects | 23 Nov 2022
Ask HN: What are the best tutorial sites for Python?
3 projects | news.ycombinator.com | 23 Nov 2022
A note from our sponsor - InfluxDB
www.influxdata.com | 27 Nov 2022
What are some of the best open-source Data Science projects in Python? This list will help you: