Data Science

Open-source projects categorized as Data Science | Edit details

Top 23 Data Science Open-Source Projects

  • GitHub repo Keras

    Deep Learning for humans

    Project mention: That time I optimized a Python program by 5000x | | 2022-01-11

    The report output for scalene does look much nicer, but the slowness for me dropped me from continuing to use it. Maybe there's some bad interaction with tensorflow/pytest. I can try to make an example, but I'd guess if you try running it on tensorflows actual unit tests (something like this) you'd get similar behavior.

  • GitHub repo scikit-learn

    scikit-learn: machine learning in Python

    Project mention: scikit-learn test case results? | | 2022-01-05
  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • GitHub repo superset

    Apache Superset is a Data Visualization and Data Exploration Platform

    Project mention: Churn Prediction With BigQueryML to Increase Mobile Game Revenue | | 2022-01-10

    You may recall a blog post in 2020 where we discussed how Torpedo was leveraging the RudderStack Unity SDK to send over 1 billion events per month to Redshift to power a robust analytics dashboard built with SuperSet. This cost-effective solution shed light on what player activities were important for driving engagement and helped refine the definition of high-value player cohorts as well as identify player churn. After crossing the 100 million download milestone, the teams at Wynn Resorts and Torpedo were looking for innovative ways to continue improving the gaming experience, user engagement and most importantly, reduce the churn on those high-value customers making in-app purchases.

  • GitHub repo MadeWithML

    Learn how to responsibly deliver value with ML.

    Project mention: New to mlops, where do I need to start | | 2021-11-01

    Standing recommendation for beginners (we should eventually make a wiki) is

  • GitHub repo ML-For-Beginners

    12 weeks, 26 lessons, 52 quizzes, classic Machine Learning for all

    Project mention: Top Github repo trends in 2021 | | 2022-01-12

    three educational courses- Web Dev, ML, and IoT for beginners. Note re using educational resources as a strategy for marketing , at least the ML course links to various Azure services. Google does this a bunch as well, with Collab notebooks often being used to demo educational materials.

  • GitHub repo Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

    aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

    Project mention: A/B test improved your website's conversion rate? Not so fast | | 2022-01-07

    Bayesian Methods For Hackers is a very popular one.


  • GitHub repo data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  • SonarLint

    Deliver Cleaner and Safer Code - Right in Your IDE of Choice!. SonarLint is a free and open source IDE extension that identifies and catches bugs and vulnerabilities as you code, directly in the IDE. Install from your favorite IDE marketplace today.

  • GitHub repo spaCy

    💫 Industrial-strength Natural Language Processing (NLP) in Python

    Project mention: Launch HN: Nyckel (YC W22) – Train and deploy ML classifiers in minutes | | 2022-01-10

    Check out Spacy, it provides PoS tagging (among other things):

  • GitHub repo ML-From-Scratch

    Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

    Project mention: Neural Network from Scratch | | 2022-01-04

    Interesting find. Just FYI, this repo has been the OG for several years, when it comes to building NN from scratch:

  • GitHub repo Ray

    An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

    Project mention: Is it normal to have a negative and near-zero explained variance in PPO? | | 2021-12-25

    I guess I did, as I directly use the PPO agent provided by the RLlib.

  • GitHub repo applied-ml

    📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.

    Project mention: Top Github repo trends in 2021 | | 2022-01-12

    The second repo I LOVE is Eugene Yan’s Applied ML repository. This is a brilliant idea to create and actually something I was planning on sort of casually doing in my non-existent free time… Anyhow, it is a curated list of technical posts from top engineering teams (Netflix, Amazon, Pinterest, Linkedin, etc.) detailing how they built out different types of AI/ML systems (e.g. forecasting, recommenders, search and ranking, etc.). Ofc, it focuses on AI/ML, but something similar could be made for the traditional or BI-oriented analytics stack, as well as the streaming world, super high value for practitioners! Btw-one of my favorite things at BCG used to be looking at our IT architecture team’s reference architecture diagrams… the best way to understand technologies is to look at how a ton of stuff is architected… and its fun!

  • GitHub repo awesome-datascience

    :memo: An awesome Data Science repository to learn and apply for real world problems.

    Project mention: High income skills? | | 2021-12-22

    There are several on github, such as:

  • GitHub repo streamlit

    Streamlit — The fastest way to build data apps in Python

    Project mention: How to Build a Machine Learning Demo in 2022 | | 2022-01-16

    So what if you want something almost as flexible as what is possible with the full-stack approach, but without the development requirements? Well, you are in luck because the past few years have seen the emergence of Python libraries that allow the creation of impressively interactive demos with only a few lines of code. In this article, we are going to focus on two of the most promising libraries: Gradio and Streamlit. There are notable differences between the two that will be explored below, but the high level idea is the same: eliminate most of the painful back and front end work outlined in the full-stack section, albeit at the cost of some flexibility.

  • GitHub repo pytorch-lightning

    The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

    Project mention: [D] Are you using PyTorch or TensorFlow going into 2022? | | 2021-12-14

    Is the problem the sheer number of options, or the fact that they are all together in one place? Would it be better if they were organized into the different trainer entrypoints (fit, validate, ...)? If that is the case, there was an RFC proposing this which you might find interesting, feel free to drop by and comment on the issue:

  • GitHub repo dash

    Analytical Web Apps for Python, R, Julia, and Jupyter. No JavaScript Required.

    Project mention: Advice on what languages / frameworks to use to build website | | 2021-12-31

    You might be able to use Dash for your needs.

  • GitHub repo AI-Expert-Roadmap

    Roadmap to becoming an Artificial Intelligence Expert in 2021

    Project mention: Top Github repo trends in 2021 | | 2022-01-12

    the AI Expert Roadmap (interactive web page), seems to have taken inspiration from the developer roadmap linked above and is awesome. I LOVE how they separate out different personas, from data scientist, to machine learning, to deep learning, to data engineering, etc. It’s really well done and fun to browse through! It is also kind of fun to juxtapose this with the aforementioned Developer Roadmap, as well as the Analytics Engineers Club, as they collected cover so much of modern tech is slightly MECE² (#BCG) ways 😃

  • GitHub repo ipython

    Official repository for IPython itself. Other repos in the IPython organization contain things like the website, documentation builds, etc.

    Project mention: New IPython defaults makes it less useful for education purposes. [Raymond Hettinger on Twitter] | | 2022-01-15

    That's not correct. In the feedback he got (GitHub, Twitter), most people (of the few that replied) came down on "opt-in" and against making it the default.

  • GitHub repo fastbook

    The fastai book, published as Jupyter Notebooks

    Project mention: Starting a career as a Python developer | | 2021-12-20

    I’m a fan of fast book by fastai.

  • GitHub repo gensim

    Topic Modelling for Humans

    Project mention: Unsupervised Learning for String Matching in Python - can I have advice on how to go about this? | | 2021-12-16
  • GitHub repo stanford-cs-229-machine-learning

    VIP cheatsheets for Stanford's CS 229 Machine Learning

    Project mention: Stanford University Probabilities and Statistics refresher | | 2021-03-24
  • GitHub repo Awesome-pytorch-list

    A comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.

    Project mention: Similar open source long library list to TF like Pytorch "ECOSYSTEM TOOLS" | | 2021-11-19

    I got the following as recombination from elsewhere - and there is one for pt as well . Thx for the help :D

  • GitHub repo d2l-en

    Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 300 universities from 55 countries including Stanford, MIT, Harvard, and Cambridge.

    Project mention: The Transformer in Machine Translation | | 2022-01-13

    GitHub's article on Dive into Deep Learning

  • GitHub repo recommenders

    Best Practices on Recommendation Systems

    Project mention: Opinion on choice of model - Recommender System | | 2021-04-10

    Then I tried to find some more advanced models and I found this really good list and in there I found the Microsoft one. So it's' where we are now, which a bunch of different models and not a documentation/tutorials out there.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-01-16.

Data Science related posts


What are some of the best open-source Data Science projects? This list will help you:

Project Stars
1 Keras 53,662
2 scikit-learn 48,549
3 superset 43,792
4 MadeWithML 29,372
5 ML-For-Beginners 28,595
6 Probabilistic-Programming-and-Bayesian-Methods-for-Hackers 23,996
7 data-science-ipython-notebooks 22,327
8 spaCy 22,176
9 ML-From-Scratch 20,725
10 Ray 18,807
11 applied-ml 17,951
12 awesome-datascience 17,738
13 streamlit 17,351
14 pytorch-lightning 16,905
15 dash 15,747
16 AI-Expert-Roadmap 15,661
17 ipython 15,167
18 fastbook 14,148
19 gensim 12,815
20 stanford-cs-229-machine-learning 12,643
21 Awesome-pytorch-list 12,547
22 d2l-en 12,134
23 recommenders 12,028
Find remote jobs at our new job board There are 29 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
OPS - Build and Run Open Source Unikernels
Quickly and easily build and deploy open source unikernels in tens of seconds. Deploy in any language to any cloud.