Data Science

Top 23 Data Science Open-Source Projects

  • Keras

    Deep Learning for humans

    Project mention: Can someone explain how keras code gets into the Tensorflow package? | /r/tensorflow | 2023-07-24

    I'm guessing the "real" keras code is coming from the keras repository. Is that a correct assumption? How does that version of Keras get there? If I wanted to write my own activation layer next to ELU, where exactly would I do that?

  • scikit-learn

    scikit-learn: machine learning in Python

    Project mention: Transformers as Support Vector Machines | | 2023-09-03

    It looks like you've been the victim of some misinformation. As Dr_Birdbrain said, an SVM is a convex problem with unique global optimum. sklearn.SVC relies on libsvm which initializes the weights to 0 [0]. The random state is only used to shuffle the data to make probability estimates with Platt scaling [1]. Of the random_state parameter, the sklearn documentation for SVC [2] says

    Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. See Glossary.




  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • superset

    Apache Superset is a Data Visualization and Data Exploration Platform

    Project mention: Apache Superset Is a Data Visualization and Data Exploration Platform | | 2023-09-11
  • ML-For-Beginners

    12 weeks, 26 lessons, 52 quizzes, classic Machine Learning for all

    Project mention: is it worth learning NLP without master degree? | /r/MLQuestions | 2023-05-01

    I don't recommend just jumping in into natural language processing directly without understanding artificial intelligence theory. I personally recommend for you to start with the basic stuff (regression, classification, and clustering, for example), and then jump into more advanced topics. You already know software developer stuff, so that's a big step already, and it should be easier to understand some concepts. Maybe follow Microsoft's machine learning for beginners curriculum? It looks like a good roadmap overall to not instantly burn out on nlp

  • Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

    Project mention: Interacting with Amazon S3 using AWS Data Wrangler (awswrangler) SDK for Pandas: A Comprehensive Guide | | 2023-08-20

    AWS Data Wrangler is a Python library that simplifies the process of interacting with various AWS services, built on top of some useful data tools and open-source projects such as Pandas, Apache Arrow and Boto3. It offers streamlined functions to connect to, retrieve, transform, and load data from AWS services, with a strong focus on Amazon S3.

  • Made-With-ML

    Learn how to design, develop, deploy and iterate on production-grade ML applications.

    Project mention: [D] How do you keep up to date on Machine Learning? | /r/learnmachinelearning | 2023-08-13

    Made With ML

  • Ray

    Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

    Project mention: Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Custom Models | | 2023-08-11

    Training times for GSM8k are mentioned here:

  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

  • streamlit

    Streamlit — A faster way to build and share data apps.

    Project mention: Stop LLM/GenAI hallucination fast: Serverless Kendra RAG with GO | | 2023-09-20

    In evaluating whether a technical solution is suitable, the focus is on the simplicity of development. So, the AWS sample uses the well-known langchain library and a streamlit server for the chat sample.

  • spaCy

    💫 Industrial-strength Natural Language Processing (NLP) in Python

    Project mention: Retrieval Augmented Generation (RAG): How To Get AI Models Learn Your Data & Give You Answers | | 2023-09-18
  • AI-Expert-Roadmap

    Roadmap to becoming an Artificial Intelligence Expert in 2022

    Project mention: Suggest which roadmap should I follow for ML? | /r/learnmachinelearning | 2023-07-20


  • Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

    aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

    Project mention: [Q] Bayesian statistics! | /r/statistics | 2023-06-11

    Also this is quite nice practical introduction which might help with finding answers to your questions:

  • data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  • applied-ml

    📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.

    Project mention: [D] Favorite ML Youtube Channels/Blogs/Newsletters | /r/MachineLearning | 2023-04-08

    Also, have any of you stumbled across any cool GitHub repos like this one: ?

  • lightning

    Deep learning framework to train, deploy, and ship AI products Lightning fast.

    Project mention: Best practice for saving logits/activation values of model in PyTorch Lightning | /r/deeplearning | 2023-07-19

    I've been wondering on what is the recommended method of saving logits/activations using PyTorch Lightning. I've looked at Callbacks, Loggers and ModelHooks but none of the use-cases seem to be for this kind of activity (even if I were to create my own custom variants of each utility). The ModelCheckpoint Callback in its utility makes me feel like custom Callbacks would be the way to go but I'm not quite sure. This closed GitHub issue does address my issue to some extent.

  • Data-Science-For-Beginners

    10 Weeks, 20 Lessons, Data Science for All!

    Project mention: Data Science for Beginners - A Curriculum | /r/programming | 2023-09-08
  • ML-From-Scratch

    Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

    Project mention: Tutorials on creating primitive ML algorithms from scratch? | /r/learnmachinelearning | 2023-01-24


  • gradio

    Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!

    Project mention: Gradio sharable link expires too soon ( 30 mins to 1 hour, instead of lasting 72 hours ) | /r/StableDiffusion | 2023-06-10

    I found an issue on gradio github but looks like it's closed so I am not sure if it's still a common issue or only I am facing it due to certain settings/absence of a fix. ( )

  • awesome-datascience

    :memo: An awesome Data Science repository to learn and apply for real world problems.

    Project mention: Mastering Data Science: Top 10 GitHub Repos You Need to Know | | 2023-04-24

    9. Awesome Data Science If you’re on the hunt for data science resources, Awesome Data Science is a goldmine. This curated list includes MOOCs, books, courses, blogs, podcasts, software, and more, all related to data science.

  • dash

    Data Apps & Dashboards for Python. No JavaScript Required.

    Project mention: [Python] NiceGUI: Lassen Sie jeden Browser das Frontend für Ihren Python-Code sein | /r/aufdeutsch | 2023-04-25

    Of course there are valid use cases for splitting frontend and backend technologies. NiceGUI is for those who don’t want to leave the Python ecosystem and like to reap the benefits of having all code in one place. There are other options like Streamlit, Dash, Anvil, JustPy, and Pynecone. But we initially created NiceGUI to easily handle the state of external hardware like LEDs, motors, and cameras. Additionally, we wanted to offer a gentle learning curve while still providing the ability to go all the way down to HTML, CSS, and JavaScript if needed.

  • fastbook

    The fastai book, published as Jupyter Notebooks

    Project mention: Fastai Chapter 4 - The important parts, Part 2: Building a regression model | | 2023-01-25

    The book is available online here The course is accessible here

  • d2l-en

    Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.

    Project mention: which book to chose for deep learning :lan Goodfellow or francois chollet | /r/learnmachinelearning | 2023-04-07
  • matplotlib

    matplotlib: plotting with Python

    Project mention: Tkinter, PyGame windows too large on Mac | /r/learnpython | 2023-06-29

    as suggested here.

  • recommenders

    Best Practices on Recommendation Systems

    Project mention: My kernel dies when I fit my LightFm model from Microsoft Recommenders | /r/Jupyter | 2023-06-16
  • Mergify

    Updating dependencies is time-consuming.. Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-09-20.

Data Science related posts


What are some of the best open-source Data Science projects? This list will help you:

Project Stars
1 Keras 59,372
2 scikit-learn 55,910
3 superset 54,269
4 ML-For-Beginners 53,590
5 Pandas 39,797
6 Made-With-ML 34,174
7 Ray 27,697
8 streamlit 27,263
9 spaCy 27,161
10 AI-Expert-Roadmap 26,781
11 Probabilistic-Programming-and-Bayesian-Methods-for-Hackers 25,865
12 data-science-ipython-notebooks 25,585
13 applied-ml 24,749
14 lightning 24,653
15 Data-Science-For-Beginners 22,550
16 ML-From-Scratch 22,390
17 gradio 21,988
18 awesome-datascience 21,847
19 dash 19,382
20 fastbook 19,211
21 d2l-en 19,183
22 matplotlib 18,108
23 recommenders 16,359
Updating dependencies is time-consuming.
Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free.