Python Data Science

Open-source Python projects categorized as Data Science

Top 23 Python Data Science Projects

  • GitHub repo Keras

    Deep Learning for humans

    Project mention: [D] Batch normalization before or after activation function | reddit.com/r/MachineLearning | 2021-02-23
  • GitHub repo scikit-learn

    scikit-learn: machine learning in Python

    Project mention: Using TinyML to identify farts | dev.to | 2021-02-22

    The model in question is trained using Scikit-Learn, a Python Machine Learning library. The audio data is loaded into numpy arrays, then split into training and testing data, the model is trained using the training data, then tested with the testing data to give an idea on the accuracy.

  • Scout

    Get performance insights in less than 4 minutes. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.

  • GitHub repo superset

    Apache Superset is a Data Visualization and Data Exploration Platform

    Project mention: Publishing dashboards for clients (advice and suggestions plz) | reddit.com/r/BusinessIntelligence | 2021-02-23

    Many people use Apache Superset this way, in the 'embedded' way: superset.apache.org Since its open source, you can customize it extensively.

  • GitHub repo data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

    Project mention: Resources for learning Python from scratch specifically for data ingestion | reddit.com/r/learnpython | 2021-02-13

    data science ipython notebooks

  • GitHub repo spaCy

    💫 Industrial-strength Natural Language Processing (NLP) in Python

    Project mention: Ask HN: What is your production ML stack like? (2021) | news.ycombinator.com | 2021-02-08

    Here's the ML stack I have been using for my last project:

    - Doing NLP with spaCy (https://spacy.io/) as I consider it to be the most production ready framework for NLP

    - Annotating datasets with Prodigy (https://prodi.gy/), a paid tool made by the spaCy team

    - Deploying the trained spaCy models onto NLP Cloud (https://nlpcloud.io)

    - Use the models through the NLP Cloud API in production and enrich my Django application out of it

  • GitHub repo Ray

    An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

    Project mention: How to get my multi-agents more collaborative? | reddit.com/r/reinforcementlearning | 2021-02-15

    QMIX is indeed a great paper. I'm planning on using it with RLLIB on my env, however it asks some work to adapt and understand the subtleties ;) ( such as the agents groups : https://github.com/ray-project/ray/blob/936cb5929c455102d5638ff5d59c80c4ae94770f/rllib/env/multi_agent_env.py#L82 )

  • GitHub repo ipython

    Official repository for IPython itself. Other repos in the IPython organization contain things like the website, documentation builds, etc.

    Project mention: Question About Embedding Html Audio Tags In | reddit.com/r/IPython | 2021-02-17

    I've duplicated your error, and it appears to only happen with .wav files. It seems to be a Firefox issue.

  • GitHub repo dash

    Analytical Web Apps for Python, R, Julia, and Jupyter. No JavaScript Required.

    Project mention: Best python GUI to learn? | reddit.com/r/learnpython | 2021-02-23

    If you want a web based dashboard then dash is the way to go

  • GitHub repo streamlit

    Streamlit — The fastest way to build data apps in Python

    Project mention: Which GUI framework do you/would you use for which purposes and why? | reddit.com/r/Python | 2021-02-13

    streamlit (Oriented Data science)

  • GitHub repo pytorch-lightning

    The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

    Project mention: DDP with model parallelism with multi host multi GPU system | reddit.com/r/pytorch | 2021-02-07
  • GitHub repo gensim

    Topic Modelling for Humans

    Project mention: Koan: A word2vec negative sampling implementation with correct CBOW update | news.ycombinator.com | 2021-01-02

    Apparently it did: https://github.com/RaRe-Technologies/gensim/issues/1873

  • GitHub repo allennlp

    An open-source NLP research library, built on PyTorch.

    Project mention: AllenNLP v2.0.0 | news.ycombinator.com | 2021-01-27
  • GitHub repo TFLearn

    Deep learning library featuring a higher-level API for TensorFlow.

  • GitHub repo nni

    An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.

    Project mention: How we were able to achieve hyper-parameter tuning (HPT) for deep learning workflows at 1.5x faster in our clusters and 3x cheaper on AWS | reddit.com/r/learnmachinelearning | 2021-02-23

    To tackle the problem of long and expensive HPT workflows, our team at Petuum collaborated with Microsoft to integrate AdaptDL with Neural Network Intelligence (NNI). AdaptDL is an open-source tool in the CASL (Composable, Automatic, and Scalable Learning) ecosystem. AdaptDL offers adaptive resource management for distributed clusters, and reduces the cost of deep learning workloads ranging from a few training/tuning trials to thousands. NNI from the Microsoft open-source community, is a toolkit for automatic machine learning (AutoML) and hyper-parameter tuning.

  • GitHub repo seaborn

    Statistical data visualization using matplotlib

  • GitHub repo dvc

    🦉Data Version Control | Git for Data & Models

    Project mention: SnowFS – a fast, scalable version control file storage for graphic files | news.ycombinator.com | 2021-02-20

    Very interesting. I'd like to learn more about how it works. How does this compare to DVC[1], for instance?

    I'll throw in a shameless plug for my tool in this area, Dud[2]. Dud is to DVC what Flask is to Django.

    Are the mentioned benchmarks published somewhere?

    [1]: https://dvc.org

  • GitHub repo Prefect

    The easiest way to automate your data

    Project mention: [D] Software stack to replicate Azure ML / Google Auto ML on premise | reddit.com/r/MachineLearning | 2021-02-03

    Update: So far I started using Prefect (http://prefect.io). With this I can work on my local computer, submit code to Azure Blob Storage and the Prefect server. After which a agent (worker) runs the code. Logging/Metrics are not implemented yet, I might use MLFlow for this (http://mlflow.org). Furthermore, there is still a dependency on a cloud solution to store your Flows (programs) to run them on agents.

  • GitHub repo boltons

    🔩 Like builtins, but boltons. 250+ constructs, recipes, and snippets which extend (and rely on nothing but) the Python standard library. Nothing like Michael Bolton.

  • GitHub repo cookiecutter-data-science

    A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

    Project mention: How do you handle raw + clean data? | reddit.com/r/dataengineering | 2021-02-15

    Take a look at https://github.com/drivendata/cookiecutter-data-science for a well structured project layout and then make 1 script for each step (1-2-3), so that you can reproduce/modify it easily.

  • GitHub repo pyod

    (JMLR'19) A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)

    Project mention: PyOD: ~50 anomaly detection algorithms in one framework. | reddit.com/r/algotrading | 2021-01-25
  • GitHub repo best-of-ml-python

    🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.

    Project mention: best-of-python: A ranked list of awesome Python libraries and tools | reddit.com/r/Python | 2021-01-14

    Here ya go: https://github.com/ml-tooling/best-of-ml-python/pull/47

  • GitHub repo metaflow

    Build and manage real-life data science projects with ease.

    Project mention: Netflix's Metaflow: Reproducible machine learning pipelines | news.ycombinator.com | 2020-12-21

    has anyone done a comparison of ML pipelines from a devops centric perspective ?

    For example, Metaflow doesnt support kubernetes today - https://github.com/Netflix/metaflow/issues/16

    so ultimately the scale up story in most of these management tools is iffy.

    I previously asked about kubeflow here - https://news.ycombinator.com/item?id=24808090 . Seems people think its pretty "horrendous". It seems most of these tools assume a very specialised devops team who will work around the ml tool...rather than the ml tool making this easy.

  • GitHub repo great_expectations

    Always know what to expect from your data.

    Project mention: For those using Airflow for your ELT/Orchestration, How are you perfroming your EL? | reddit.com/r/dataengineering | 2021-01-30

    (T) : https://github.com/fishtown-analytics/dbt + https://github.com/great-expectations/great_expectations + https://github.com/dagster-io/dagster

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-02-23.

Index

What are some of the best open-source Data Science projects in Python? This list will help you:

Project Stars
1 Keras 50,757
2 scikit-learn 44,626
3 superset 35,438
4 data-science-ipython-notebooks 20,249
5 spaCy 19,619
6 Ray 14,865
7 ipython 14,677
8 dash 13,974
9 streamlit 13,389
10 pytorch-lightning 12,092
11 gensim 11,750
12 allennlp 9,712
13 TFLearn 9,522
14 nni 9,102
15 seaborn 8,124
16 dvc 7,354
17 Prefect 5,880
18 boltons 5,382
19 cookiecutter-data-science 4,235
20 pyod 4,174
21 best-of-ml-python 4,148
22 metaflow 4,076
23 great_expectations 3,678