Top 23 Python Data Science Projects
-
Keras
Deep Learning for humans
Project mention: [D] Batch normalization before or after activation function | reddit.com/r/MachineLearning | 2021-02-23 -
scikit-learn
scikit-learn: machine learning in Python
The model in question is trained using Scikit-Learn, a Python Machine Learning library. The audio data is loaded into numpy arrays, then split into training and testing data, the model is trained using the training data, then tested with the testing data to give an idea on the accuracy.
-
Scout
Get performance insights in less than 4 minutes. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
-
superset
Apache Superset is a Data Visualization and Data Exploration Platform
Project mention: Publishing dashboards for clients (advice and suggestions plz) | reddit.com/r/BusinessIntelligence | 2021-02-23Many people use Apache Superset this way, in the 'embedded' way: superset.apache.org Since its open source, you can customize it extensively.
-
data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Project mention: Resources for learning Python from scratch specifically for data ingestion | reddit.com/r/learnpython | 2021-02-13data science ipython notebooks
-
spaCy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Project mention: Ask HN: What is your production ML stack like? (2021) | news.ycombinator.com | 2021-02-08Here's the ML stack I have been using for my last project:
- Doing NLP with spaCy (https://spacy.io/) as I consider it to be the most production ready framework for NLP
- Annotating datasets with Prodigy (https://prodi.gy/), a paid tool made by the spaCy team
- Deploying the trained spaCy models onto NLP Cloud (https://nlpcloud.io)
- Use the models through the NLP Cloud API in production and enrich my Django application out of it
-
Ray
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.
Project mention: How to get my multi-agents more collaborative? | reddit.com/r/reinforcementlearning | 2021-02-15QMIX is indeed a great paper. I'm planning on using it with RLLIB on my env, however it asks some work to adapt and understand the subtleties ;) ( such as the agents groups : https://github.com/ray-project/ray/blob/936cb5929c455102d5638ff5d59c80c4ae94770f/rllib/env/multi_agent_env.py#L82 )
-
ipython
Official repository for IPython itself. Other repos in the IPython organization contain things like the website, documentation builds, etc.
I've duplicated your error, and it appears to only happen with .wav files. It seems to be a Firefox issue.
-
dash
Analytical Web Apps for Python, R, Julia, and Jupyter. No JavaScript Required.
If you want a web based dashboard then dash is the way to go
-
streamlit
Streamlit — The fastest way to build data apps in Python
Project mention: Which GUI framework do you/would you use for which purposes and why? | reddit.com/r/Python | 2021-02-13streamlit (Oriented Data science)
-
pytorch-lightning
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.
Project mention: DDP with model parallelism with multi host multi GPU system | reddit.com/r/pytorch | 2021-02-07 -
gensim
Topic Modelling for Humans
Project mention: Koan: A word2vec negative sampling implementation with correct CBOW update | news.ycombinator.com | 2021-01-02Apparently it did: https://github.com/RaRe-Technologies/gensim/issues/1873
-
allennlp
An open-source NLP research library, built on PyTorch.
-
TFLearn
Deep learning library featuring a higher-level API for TensorFlow.
-
nni
An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
Project mention: How we were able to achieve hyper-parameter tuning (HPT) for deep learning workflows at 1.5x faster in our clusters and 3x cheaper on AWS | reddit.com/r/learnmachinelearning | 2021-02-23To tackle the problem of long and expensive HPT workflows, our team at Petuum collaborated with Microsoft to integrate AdaptDL with Neural Network Intelligence (NNI). AdaptDL is an open-source tool in the CASL (Composable, Automatic, and Scalable Learning) ecosystem. AdaptDL offers adaptive resource management for distributed clusters, and reduces the cost of deep learning workloads ranging from a few training/tuning trials to thousands. NNI from the Microsoft open-source community, is a toolkit for automatic machine learning (AutoML) and hyper-parameter tuning.
-
seaborn
Statistical data visualization using matplotlib
-
dvc
🦉Data Version Control | Git for Data & Models
Project mention: SnowFS – a fast, scalable version control file storage for graphic files | news.ycombinator.com | 2021-02-20Very interesting. I'd like to learn more about how it works. How does this compare to DVC[1], for instance?
I'll throw in a shameless plug for my tool in this area, Dud[2]. Dud is to DVC what Flask is to Django.
Are the mentioned benchmarks published somewhere?
[1]: https://dvc.org
-
Prefect
The easiest way to automate your data
Project mention: [D] Software stack to replicate Azure ML / Google Auto ML on premise | reddit.com/r/MachineLearning | 2021-02-03Update: So far I started using Prefect (http://prefect.io). With this I can work on my local computer, submit code to Azure Blob Storage and the Prefect server. After which a agent (worker) runs the code. Logging/Metrics are not implemented yet, I might use MLFlow for this (http://mlflow.org). Furthermore, there is still a dependency on a cloud solution to store your Flows (programs) to run them on agents.
-
boltons
🔩 Like builtins, but boltons. 250+ constructs, recipes, and snippets which extend (and rely on nothing but) the Python standard library. Nothing like Michael Bolton.
-
cookiecutter-data-science
A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.
Take a look at https://github.com/drivendata/cookiecutter-data-science for a well structured project layout and then make 1 script for each step (1-2-3), so that you can reproduce/modify it easily.
-
pyod
(JMLR'19) A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)
Project mention: PyOD: ~50 anomaly detection algorithms in one framework. | reddit.com/r/algotrading | 2021-01-25 -
best-of-ml-python
🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.
Project mention: best-of-python: A ranked list of awesome Python libraries and tools | reddit.com/r/Python | 2021-01-14Here ya go: https://github.com/ml-tooling/best-of-ml-python/pull/47
-
metaflow
Build and manage real-life data science projects with ease.
Project mention: Netflix's Metaflow: Reproducible machine learning pipelines | news.ycombinator.com | 2020-12-21has anyone done a comparison of ML pipelines from a devops centric perspective ?
For example, Metaflow doesnt support kubernetes today - https://github.com/Netflix/metaflow/issues/16
so ultimately the scale up story in most of these management tools is iffy.
I previously asked about kubeflow here - https://news.ycombinator.com/item?id=24808090 . Seems people think its pretty "horrendous". It seems most of these tools assume a very specialised devops team who will work around the ml tool...rather than the ml tool making this easy.
-
great_expectations
Always know what to expect from your data.
Project mention: For those using Airflow for your ELT/Orchestration, How are you perfroming your EL? | reddit.com/r/dataengineering | 2021-01-30(T) : https://github.com/fishtown-analytics/dbt + https://github.com/great-expectations/great_expectations + https://github.com/dagster-io/dagster
Index
What are some of the best open-source Data Science projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | Keras | 50,757 |
2 | scikit-learn | 44,626 |
3 | superset | 35,438 |
4 | data-science-ipython-notebooks | 20,249 |
5 | spaCy | 19,619 |
6 | Ray | 14,865 |
7 | ipython | 14,677 |
8 | dash | 13,974 |
9 | streamlit | 13,389 |
10 | pytorch-lightning | 12,092 |
11 | gensim | 11,750 |
12 | allennlp | 9,712 |
13 | TFLearn | 9,522 |
14 | nni | 9,102 |
15 | seaborn | 8,124 |
16 | dvc | 7,354 |
17 | Prefect | 5,880 |
18 | boltons | 5,382 |
19 | cookiecutter-data-science | 4,235 |
20 | pyod | 4,174 |
21 | best-of-ml-python | 4,148 |
22 | metaflow | 4,076 |
23 | great_expectations | 3,678 |