Python Data Science

Open-source Python projects categorized as Data Science

Top 23 Python Data Science Projects

  • GitHub repo Keras

    Deep Learning for humans

    Project mention: [Project] I'm trying to implement StyleGAN2 in Keras to better understand its structure and just AAAAAAAA | reddit.com/r/learnmachinelearning | 2021-05-17
  • GitHub repo scikit-learn

    scikit-learn: machine learning in Python

    Project mention: Is there a way to map cluster centers back to a dataframe? | reddit.com/r/learnpython | 2021-05-19

    To avoid the issue with convergence (and the discrepancy between the labels_ and cluster_centers_), you can set tol=0, though this can of course lead to issues if convergence is a problem. There was an issue about it here. Assuming it's converged, then the order is fine.

  • GitHub repo superset

    Apache Superset is a Data Visualization and Data Exploration Platform

    Project mention: Jupyter notebooks for dashboarding? | reddit.com/r/BusinessIntelligence | 2021-06-13

    Give a try to apache superset

  • GitHub repo data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

    Project mention: Beginner in Python for Data Science | reddit.com/r/learnpython | 2020-12-27

    data science ipython notebooks

  • GitHub repo spaCy

    💫 Industrial-strength Natural Language Processing (NLP) in Python

    Project mention: Resume Advice Thread - June 08, 2021 | reddit.com/r/cscareerquestions | 2021-06-08

    "metadata" is "meta-data", "Spacy" is formally "spaCy", "Node" is formally "Node.js", "Mongo" is formally "MongoDB", "Websockets" is (possibly) "WebSocket", "twitter" is formally "Twitter", and "Javascript" is formally "JavaScript".

  • GitHub repo Ray

    An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

    Project mention: Ray 1.4.0 | news.ycombinator.com | 2021-06-08
  • GitHub repo ipython

    Official repository for IPython itself. Other repos in the IPython organization contain things like the website, documentation builds, etc.

    Project mention: A resource for looking 'under the hood' of multi-processing in python | reddit.com/r/learnpython | 2021-04-19

    I stand corrected... Apparently, it does matter, if you are on Windows... https://github.com/ipython/ipython/issues/4698#issuecomment-30605308

  • GitHub repo streamlit

    Streamlit — The fastest way to build data apps in Python

    Project mention: Jupyter notebooks for dashboarding? | reddit.com/r/BusinessIntelligence | 2021-06-13
  • GitHub repo dash

    Analytical Web Apps for Python, R, Julia, and Jupyter. No JavaScript Required.

    Project mention: What tools are available for personal use? | reddit.com/r/datascience | 2021-06-07

    Plotly-Dash

  • GitHub repo pytorch-lightning

    The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

    Project mention: [P] An introduction to PyKale https://github.com/pykale/pykale​, a PyTorch library that provides a unified pipeline-based API for knowledge-aware multimodal learning and transfer learning on graphs, images, texts, and videos to accelerate interdisciplinary research. Welcome feedback/contribution! | reddit.com/r/MachineLearning | 2021-04-25

    If you want a good example for reference, take a look at Pytorch Lightning's readme (https://github.com/PyTorchLightning/pytorch-lightning) It answers the 3 questions of "what is this", "why should I care", and "how do i use it" almost instantly

  • GitHub repo gensim

    Topic Modelling for Humans

    Project mention: The Levenshtein Distance in Production | news.ycombinator.com | 2021-06-06

    > Problem statement: the Levenshtein distance is a string metric for measuring the difference between two sequences

    Another variant is "I have a bunch of words (a dictionary) and one query word, and want to find all words from the dictionary that are close to the query word".

    This leads to an interesting class of problems, because you can do clever things where you precompute search structures (Levenshtein automata [0]) from the dictionary. The similarity queries then run (much) faster – in production, performance matters.

    We recently merged a PR like that into Gensim [1].

    This gave a ~1,500x speed-up compared to naively comparing all pairwise strings with Levenshtein distance. A difference between the training step running for years (=unusable) and minutes.

    [0] http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levensht...

    [1] https://github.com/RaRe-Technologies/gensim/pull/3146

  • GitHub repo recommenders

    Best Practices on Recommendation Systems

    Project mention: Opinion on choice of model - Recommender System | reddit.com/r/datascience | 2021-04-10

    Then I tried to find some more advanced models and I found this really good list and in there I found the Microsoft one. So it's' where we are now, which a bunch of different models and not a documentation/tutorials out there.

  • GitHub repo allennlp

    An open-source NLP research library, built on PyTorch.

    Project mention: C4 dataset released (800GB Common Crawl-derived text; T5 training data) | reddit.com/r/mlscaling | 2021-03-16
  • GitHub repo d2l-en

    Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 175 universities.

    Project mention: I created a way to learn machine learning through Jupyter | reddit.com/r/learnmachinelearning | 2021-04-30

    There are actually some online books and courses built on Jupyter Notebook ([Dive to Deep Learning Book](https://github.com/d2l-ai/d2l-en) for example). However yours is more detail and could really helps beginners.

  • GitHub repo nni

    An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.

    Project mention: [D] Efficient ways of choosing number of layers/neurons in a neural network | reddit.com/r/statistics | 2021-04-20

    optuna, hyperopt, nni, plenty of less-known tools too.

  • GitHub repo TFLearn

    Deep learning library featuring a higher-level API for TensorFlow.

    Project mention: Base ball | dev.to | 2021-03-20

    Both the teams in a game are given their individual ID values and are made into vectors. Relevant data like the home and away team, home runs, RBI’s, and walk’s are all taken into account and passed through layers. There’s no need to reinvent the wheel here, there's a multitude of libraries that enable a coder to implement machine learning theories efficiently. In this case we will be using a library called TFlearn, documentation available from http://tflearn.org. The program will output the home and away teams as well as their respective score predictions.

  • GitHub repo seaborn

    Statistical data visualization using matplotlib

    Project mention: [OC] Visualizing the impact of dice choice on outcome | reddit.com/r/DnD | 2021-05-30

    https://seaborn.pydata.org/ It's a plot library, a bit more user friendly/pretty out of the box than raw matplotlib. sns is just an alias (import seaborn as sns).

  • GitHub repo dvc

    🦉Data Version Control | Git for Data & Models | ML Experiments Management

    Project mention: [Project] DVC Studio – Git-Based ML Experiments Management | reddit.com/r/MachineLearning | 2021-06-02

    Hey everyone, our team is working on open-source tools for data scientists: https://dvc.org and https://cml.dev. These two products help ML teams track ML experiments and run training in the cloud using Git & GitOps approach.

  • GitHub repo Prefect

    The easiest way to automate your data

    Project mention: Hi, how can I do pipeline automation? | reddit.com/r/learnpython | 2021-04-18

    If you are just starting out or new to doing automation, I would look at just python scripts executed with CRON if on Linux/Mac or Windows Task Scheduler if on Windows. But you'll need bash (Linux/Mac) knowledge or DOS/batch knowledge (Windows). Then graduate to using frameworks. Since you didnt specify what types of jobs you want to automate, for general purpose needs, I would look at a class of frameworks called task orchestration frameworks or workflow management libraries. I would highly recommend dagster as it comes with a native scheduler so you would be free from having to use CRON or Windows Task Scheduler. Other options include prefect, but if you want its other features like its scheduler and web GUI, you'll have to mess with docker. That's what's nice about dagster, it all works out of the box without need for non-Python dependencies.

  • GitHub repo boltons

    🔩 Like builtins, but boltons. 250+ constructs, recipes, and snippets which extend (and rely on nothing but) the Python standard library. Nothing like Michael Bolton.

  • GitHub repo best-of-ml-python

    🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.

    Project mention: Are there any speech recognition modules so I can write one and do not have to rely on google and the likes? | reddit.com/r/learnmachinelearning | 2021-04-18
  • GitHub repo knowledge-repo

    A next-generation curated knowledge sharing platform for data scientists and other technical professions.

    Project mention: How does everyone share their models etc. across teams for re-use effectively? | reddit.com/r/datascience | 2021-05-22
  • GitHub repo cookiecutter-data-science

    A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

    Project mention: What GitHub template do you guys follow? | reddit.com/r/datascience | 2021-05-01

    I have to set up a GitHub repo for an upcoming project and was researching some data science templates to follow. I came across cookie cutter and this template by drivendata: https://github.com/drivendata/cookiecutter-data-science

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-06-13.

Index

What are some of the best open-source Data Science projects in Python? This list will help you:

Project Stars
1 Keras 51,308
2 scikit-learn 46,078
3 superset 39,046
4 data-science-ipython-notebooks 21,197
5 spaCy 20,639
6 Ray 16,206
7 ipython 14,833
8 streamlit 14,832
9 dash 14,679
10 pytorch-lightning 13,792
11 gensim 12,156
12 recommenders 10,304
13 allennlp 10,093
14 d2l-en 10,071
15 nni 9,769
16 TFLearn 9,549
17 seaborn 8,487
18 dvc 8,103
19 Prefect 6,394
20 boltons 5,480
21 best-of-ml-python 5,300
22 knowledge-repo 4,792
23 cookiecutter-data-science 4,707