Python Data Science

Open-source Python projects categorized as Data Science

Top 23 Python Data Science Projects

  • Keras

    Deep Learning for humans

    Project mention: Keras 3.0 | news.ycombinator.com | 2023-11-28

    All breaking changes are listed here: https://github.com/keras-team/keras/issues/18467

    You can use this migration guide to identify and fix each of these issues (and further, making your code run on JAX or PyTorch): https://keras.io/guides/migrating_to_keras_3/

  • scikit-learn

    scikit-learn: machine learning in Python

    Project mention: Polars | news.ycombinator.com | 2024-01-08

    sklearn is adding support through the dataframe interchange protocol (https://github.com/scikit-learn/scikit-learn/issues/25896). scipy, as far as I know, doesn't explicitly support dataframes (it just happens to work when you wrap a Series in `np.array` or `np.asarray`). I don't know about PyTorch but in general you can convert to numpy.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

    Project mention: Deploying a Serverless Dash App with AWS SAM and Lambda | dev.to | 2024-03-04

    Dash is a Python framework that enables you to build interactive frontend applications without writing a single line of Javascript. Internally and in projects we like to use it in order to build a quick proof of concept for data driven applications because of the nice integration with Plotly and pandas. For this post, I'm going to assume that you're already familiar with Dash and won't explain that part in detail. Instead, we'll focus on what's necessary to make it run serverless.

  • Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Project mention: Building in Public: Leveraging Tublian's AI Copilot for My Open Source Contributions | dev.to | 2024-02-12

    Contributing to Apache Airflow's open-source project immersed me in collaborative coding. Experienced maintainers rigorously reviewed my contributions, providing constructive feedback. This ongoing dialogue refined the codebase and honed my understanding of best practices.

  • streamlit

    Streamlit — A faster way to build and share data apps.

    Project mention: Show HN: Buefy Web Components for Streamlit | news.ycombinator.com | 2024-03-04

    While building dashboards in Streamlit, I found myself really missing Buefy's (Bulma) modern web components.

    Specially due to the inability to add new values to Streamlit's multiselect [1], some missing controls like a polished image carousel [2] or a highly customizable data table.

    Long story short, we put together streamfy (Streamlit + Buefy) as an MIT licensed project in GitHub to bring Buefy to Streamlit.

    Demo: https://streamfy.streamlit.app

    All the form components are implemented, missing half of other non-form UX components. There is plenty of room for PRs, testing, feedback, documentation, example, etc.

    Please send issues and contributions to GitHub project [3] and general feedback to X / Twitter [4]

    Thanks!

    [1] https://github.com/streamlit/streamlit/issues/5348

  • Ray

    Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

    Project mention: Open Source Advent Fun Wraps Up! | dev.to | 2024-01-05

    22. Ray | Github | tutorial

  • spaCy

    💫 Industrial-strength Natural Language Processing (NLP) in Python

    Project mention: Best AI SEO Tools for NLP Content Optimization | /r/aitoolsnews | 2023-12-09

    SpaCy: An open-source library providing tools for advanced NLP tasks like tokenization, entity recognition, and part-of-speech tagging.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • gradio

    Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!

    Project mention: Show HN: Dropbase – Build internal web apps with just Python | news.ycombinator.com | 2023-12-05

    There's also that library all the AI models started using that gives you a public URL to share. After researching it: https://www.gradio.app/ is the link.

    It's used specifically for making simple UIs for machine learning apps. But I guess technically you could use it for anything.

  • pytorch-lightning

    Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.

    Project mention: Lightning AI Studios – A persistent GPU cloud environment | news.ycombinator.com | 2023-12-14
  • data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  • ML-From-Scratch

    Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

  • d2l-en

    Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.

    Project mention: which book to chose for deep learning :lan Goodfellow or francois chollet | /r/learnmachinelearning | 2023-04-07
  • dash

    Data Apps & Dashboards for Python. No JavaScript Required.

    Project mention: dash VS solara - a user suggested alternative | libhunt.com/r/dash | 2023-10-13
  • matplotlib

    matplotlib: plotting with Python

    Project mention: How and where is matplotlib package making use of PySide? | /r/learnpython | 2023-12-07
  • recommenders

    Best Practices on Recommendation Systems

    Project mention: My kernel dies when I fit my LightFm model from Microsoft Recommenders | /r/Jupyter | 2023-06-16
  • ipython

    Official repository for IPython itself. Other repos in the IPython organization contain things like the website, documentation builds, etc.

    Project mention: The new pdbp (Pdb+) Python debugger! | dev.to | 2023-08-02

    If you’re already using ipython, this isn’t a problem because you’ll already need to download most of these dependencies anyway. But if you’re not using ipython… you’ll still need to download those dependencies.

  • best-of-ml-python

    🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.

  • gensim

    Topic Modelling for Humans

    Project mention: Aggregating news from different sources | /r/learnprogramming | 2023-07-08
  • Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Project mention: Prefect: A workflow orchestration tool for data pipelines | news.ycombinator.com | 2024-03-13
  • nni

    An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.

    Project mention: Filter Pruning for PyTorch | /r/deeplearning | 2023-04-13
  • dvc

    🦉 ML Experiments Management with Git

    Project mention: Why bad scientific code beats code following "best practices" | news.ycombinator.com | 2024-01-06

    What you’re describing sounds like DVC (at a higher-ish—80%-solution level).

    https://dvc.org/

    See pachyderm too.

  • ydata-profiling

    1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

    Project mention: FLaNK 25 December 2023 | dev.to | 2023-12-26
  • seaborn

    Statistical data visualization in Python

    Project mention: Apache Superset | news.ycombinator.com | 2024-02-26

    If you are doing data analysis I don't think any of the 3 pieces of software you mentioned are going to be that helpful.

    I see these products as tools for data visualization and reporting i.e. presenting prepared datasets to users in a visually appealing way. They aren't as well suited for serious analytics.

    I can't comment on Superset or Tableau but I am familiar with Power BI (it has been rolled out across my org), the type of statistics you can do with it are fairly rudimentary. If you need to do any thing beyond summarizing (counts, averages, min, max etc). It is not particularly easy.

    For data analysis I use SAS or R. This software allows you do things like multivariate regression, timeseries forecasting, PCA, Cluster analysis etc. There is also plotting capability.

    Both these products are kind of old school, I've been using them since early 2000's, the "new school" seems to be Python. Pretty much all the recent data science people in my organization use Python. Particularly Pandas and libraries like Seaborn (https://seaborn.pydata.org/).

    The "power" users of Power BI in my organization tend to be finance/HR people for use cases like drill down into cost figures or Interactively presenting KPI's and other headline figures to management things like that.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-03-13.

Python Data Science related posts

Index

What are some of the best open-source Data Science projects in Python? This list will help you:

Project Stars
1 Keras 60,643
2 scikit-learn 57,674
3 Pandas 41,573
4 Airflow 33,864
5 streamlit 30,808
6 Ray 30,364
7 spaCy 28,455
8 gradio 27,486
9 pytorch-lightning 26,457
10 data-science-ipython-notebooks 26,278
11 ML-From-Scratch 23,004
12 d2l-en 21,232
13 dash 20,291
14 matplotlib 19,003
15 recommenders 17,709
16 ipython 16,111
17 best-of-ml-python 15,178
18 gensim 15,074
19 Prefect 14,278
20 nni 13,646
21 dvc 12,976
22 ydata-profiling 11,904
23 seaborn 11,808
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com