Top 23 Python Data Science Projects

Keras

77 60,937 9.9 Python

Deep Learning for humans

Project mention: My Favorite DevTools to Build AI/ML Applications! | dev.to | 2024-04-23

As a beginner, I was looking for something simple and flexible for developing deep learning models and that is when I found Keras. Many AI/ML professionals appreciate Keras for its simplicity and efficiency in prototyping and developing deep learning models, making it a preferred choice, especially for beginners and for projects requiring rapid development.

scikit-learn

81 58,046 9.9 Python

scikit-learn: machine learning in Python

Project mention: AutoCodeRover resolves 22% of real-world GitHub in SWE-bench lite | news.ycombinator.com | 2024-04-09

Thank you for your interest. There are some interesting examples in the SWE-bench-lite benchmark which are resolved by AutoCodeRover:
- From sympy: https://github.com/sympy/sympy/issues/13643. AutoCodeRover's patch for it: https://github.com/nus-apr/auto-code-rover/blob/main/results...
- Another one from scikit-learn: https://github.com/scikit-learn/scikit-learn/issues/13070. AutoCodeRover's patch (https://github.com/nus-apr/auto-code-rover/blob/main/results...) modified a few lines below (compared to the developer patch) and wrote a different comment.
There are more examples in the results directory (https://github.com/nus-apr/auto-code-rover/tree/main/results).

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Pandas

393 41,923 10.0 Python

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Project mention: Deploying a Serverless Dash App with AWS SAM and Lambda | dev.to | 2024-03-04

Dash is a Python framework that enables you to build interactive frontend applications without writing a single line of Javascript. Internally and in projects we like to use it in order to build a quick proof of concept for data driven applications because of the nice integration with Plotly and pandas. For this post, I'm going to assume that you're already familiar with Dash and won't explain that part in detail. Instead, we'll focus on what's necessary to make it run serverless.

Airflow

169 34,485 10.0 Python

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Project mention: Building in Public: Leveraging Tublian's AI Copilot for My Open Source Contributions | dev.to | 2024-02-12

Contributing to Apache Airflow's open-source project immersed me in collaborative coding. Experienced maintainers rigorously reviewed my contributions, providing constructive feedback. This ongoing dialogue refined the codebase and honed my understanding of best practices.

streamlit

254 31,506 9.8 Python

Streamlit — A faster way to build and share data apps.

Project mention: Creating a Sales Analysis Application with Streamlit: A Practical Approach to Business Intelligence | dev.to | 2024-04-19

2.-Go to https://streamlit.io, log in, and create a new app from your GitHub repository.

Ray

42 30,988 10.0 Python

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Project mention: Open Source Advent Fun Wraps Up! | dev.to | 2024-01-05

22. Ray | Github | tutorial

gradio

115 28,730 9.9 Python

Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!

Project mention: Show HN: Dropbase – Build internal web apps with just Python | news.ycombinator.com | 2023-12-05

There's also that library all the AI models started using that gives you a public URL to share. After researching it: https://www.gradio.app/ is the link.
It's used specifically for making simple UIs for machine learning apps. But I guess technically you could use it for anything.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
spaCy

106 28,704 9.2 Python

💫 Industrial-strength Natural Language Processing (NLP) in Python

Project mention: Step by step guide to create customized chatbot by using spaCy (Python NLP library) | dev.to | 2024-03-23

Hi Community, In this article, I will demonstrate below steps to create your own chatbot by using spaCy (spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython):

pytorch-lightning

8 26,883 9.9 Python

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.

Project mention: Lightning AI Studios – A persistent GPU cloud environment | news.ycombinator.com | 2023-12-14

data-science-ipython-notebooks

1 26,459 0.0 Python

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
ML-From-Scratch

3 23,164 0.0 Python

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.
d2l-en

6 21,628 8.7 Python

Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
dash

56 20,472 9.6 Python

Data Apps & Dashboards for Python. No JavaScript Required.

Project mention: dash VS solara - a user suggested alternative | libhunt.com/r/dash | 2023-10-13

matplotlib

36 19,223 10.0 Python

matplotlib: plotting with Python

Project mention: How and where is matplotlib package making use of PySide? | /r/learnpython | 2023-12-07

recommenders

6 17,942 9.4 Python

Best Practices on Recommendation Systems

Project mention: My kernel dies when I fit my LightFm model from Microsoft Recommenders | /r/Jupyter | 2023-06-16

ipython

34 16,134 9.6 Python

Official repository for IPython itself. Other repos in the IPython organization contain things like the website, documentation builds, etc.

Project mention: The new pdbp (Pdb+) Python debugger! | dev.to | 2023-08-02

If you’re already using ipython, this isn’t a problem because you’ll already need to download most of these dependencies anyway. But if you’re not using ipython… you’ll still need to download those dependencies.

best-of-ml-python

16 15,302 7.9 Python

🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.
gensim

18 15,236 7.5 Python

Topic Modelling for Humans

Project mention: Aggregating news from different sources | /r/learnprogramming | 2023-07-08

Prefect

19 14,586 10.0 Python

The easiest way to build, run, and monitor data pipelines at scale.

Project mention: Prefect: A workflow orchestration tool for data pipelines | news.ycombinator.com | 2024-03-13

nni

5 13,726 6.7 Python

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
dvc

109 13,116 9.7 Python

🦉 ML Experiments and Data Management with Git

Project mention: My Favorite DevTools to Build AI/ML Applications! | dev.to | 2024-04-23

Collaboration and version control are crucial in AI/ML development projects due to the iterative nature of model development and the need for reproducibility. GitHub is the leading platform for source code management, allowing teams to collaborate on code, track issues, and manage project milestones. DVC (Data Version Control) complements Git by handling large data files, data sets, and machine learning models that Git can't manage effectively, enabling version control for the data and model files used in AI projects.

ydata-profiling

43 12,022 8.5 Python

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

Project mention: FLaNK 25 December 2023 | dev.to | 2023-12-26

seaborn

76 11,946 8.5 Python

Statistical data visualization in Python

Project mention: Apache Superset | news.ycombinator.com | 2024-02-26

If you are doing data analysis I don't think any of the 3 pieces of software you mentioned are going to be that helpful.
I see these products as tools for data visualization and reporting i.e. presenting prepared datasets to users in a visually appealing way. They aren't as well suited for serious analytics.
I can't comment on Superset or Tableau but I am familiar with Power BI (it has been rolled out across my org), the type of statistics you can do with it are fairly rudimentary. If you need to do any thing beyond summarizing (counts, averages, min, max etc). It is not particularly easy.
For data analysis I use SAS or R. This software allows you do things like multivariate regression, timeseries forecasting, PCA, Cluster analysis etc. There is also plotting capability.
Both these products are kind of old school, I've been using them since early 2000's, the "new school" seems to be Python. Pretty much all the recent data science people in my organization use Python. Particularly Pandas and libraries like Seaborn (https://seaborn.pydata.org/).
The "power" users of Power BI in my organization tend to be finance/HR people for use cases like drill down into cost figures or Interactively presenting KPI's and other headline figures to management things like that.

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Data Science related posts

My Favorite DevTools to Build AI/ML Applications!
9 projects | dev.to | 23 Apr 2024
Release: Keras 3.3.0
1 project | news.ycombinator.com | 22 Apr 2024
Runhouse
1 project | news.ycombinator.com | 22 Apr 2024
Hierarchical Clustering
1 project | news.ycombinator.com | 20 Apr 2024
Creating a Sales Analysis Application with Streamlit: A Practical Approach to Business Intelligence
1 project | dev.to | 19 Apr 2024
Orange Data Mining
1 project | news.ycombinator.com | 15 Apr 2024
🦙 Llama-2-GGML-CSV-Chatbot 🤖
3 projects | dev.to | 10 Apr 2024
A note from our sponsor - SaaSHub
www.saashub.com | 25 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Data Science projects in Python? This list will help you:

	Project	Stars
1	Keras	60,937
2	scikit-learn	58,046
3	Pandas	41,923
4	Airflow	34,485
5	streamlit	31,506
6	Ray	30,988
7	gradio	28,730
8	spaCy	28,704
9	pytorch-lightning	26,883
10	data-science-ipython-notebooks	26,459
11	ML-From-Scratch	23,164
12	d2l-en	21,628
13	dash	20,472
14	matplotlib	19,223
15	recommenders	17,942
16	ipython	16,134
17	best-of-ml-python	15,302
18	gensim	15,236
19	Prefect	14,586
20	nni	13,726
21	dvc	13,116
22	ydata-profiling	12,022
23	seaborn	11,946