Top 23 Python Data Analysis Projects

scikit-learn

81 57,985 9.9 Python

scikit-learn: machine learning in Python

Project mention: AutoCodeRover resolves 22% of real-world GitHub in SWE-bench lite | news.ycombinator.com | 2024-04-09

Thank you for your interest. There are some interesting examples in the SWE-bench-lite benchmark which are resolved by AutoCodeRover:
- From sympy: https://github.com/sympy/sympy/issues/13643. AutoCodeRover's patch for it: https://github.com/nus-apr/auto-code-rover/blob/main/results...
- Another one from scikit-learn: https://github.com/scikit-learn/scikit-learn/issues/13070. AutoCodeRover's patch (https://github.com/nus-apr/auto-code-rover/blob/main/results...) modified a few lines below (compared to the developer patch) and wrote a different comment.
There are more examples in the results directory (https://github.com/nus-apr/auto-code-rover/tree/main/results).

Pandas

393 41,863 10.0 Python

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Project mention: Deploying a Serverless Dash App with AWS SAM and Lambda | dev.to | 2024-03-04

Dash is a Python framework that enables you to build interactive frontend applications without writing a single line of Javascript. Internally and in projects we like to use it in order to build a quick proof of concept for data driven applications because of the nice integration with Plotly and pandas. For this post, I'm going to assume that you're already familiar with Dash and won't explain that part in detail. Instead, we'll focus on what's necessary to make it run serverless.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
streamlit

253 31,361 9.8 Python

Streamlit — A faster way to build and share data apps.

Project mention: 🦙 Llama-2-GGML-CSV-Chatbot 🤖 | dev.to | 2024-04-10

Developed using Langchain and Streamlit technologies for enhanced performance.

gradio

115 28,556 9.9 Python

Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!

Project mention: Show HN: Dropbase – Build internal web apps with just Python | news.ycombinator.com | 2023-12-05

There's also that library all the AI models started using that gives you a public URL to share. After researching it: https://www.gradio.app/ is the link.
It's used specifically for making simple UIs for machine learning apps. But I guess technically you could use it for anything.

best-of-ml-python

16 15,284 7.9 Python

🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.
airbyte

139 13,821 10.0 Python

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | news.ycombinator.com | 2023-12-12

I'l also give a shout-out to Airbyte (https://airbyte.com/), with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.
It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.

ydata-profiling

43 11,992 8.5 Python

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

Project mention: FLaNK 25 December 2023 | dev.to | 2023-12-26

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
pygwalker

22 9,660 9.6 Python

PyGWalker: Turn your pandas dataframe into an interactive UI for visual analysis

Project mention: Show HN: Use an "eraser" to clean data on flight without breaking your workflow | news.ycombinator.com | 2024-03-15

statsmodels

8 9,513 9.4 Python

Statsmodels: statistical modeling and econometrics in Python

Project mention: statsmodels Release Candidate 0.14.0rc0 tagged | /r/Python | 2023-04-26

mlcourse.ai

85 9,382 3.4 Python

Open Machine Learning Course

Project mention: Open Machine Learning Course | news.ycombinator.com | 2023-10-22

cleanlab

69 8,592 9.4 Python

The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

Project mention: [Research] Detecting Annotation Errors in Semantic Segmentation Data | /r/MachineLearning | 2023-11-05

We have feely open-sourced our new method for improving segmentation data, published a paper on the research behind it, and released a 5-min code tutorial. You can also read more in the blog if you'd like.

akshare

0 8,321 9.7 Python

AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)
pyod

7 7,928 7.7 Python

A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)

Project mention: A Comprehensive Guide for Building Rag-Based LLM Applications | news.ycombinator.com | 2023-09-13

This is a feature in many commercial products already, as well as open source libraries like PyOD. https://github.com/yzhao062/pyod

imbalanced-learn

1 6,687 7.4 Python

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

Project mention: What’s your approach to highly imbalanced data sets? | /r/datascience | 2023-05-26

There's a pletora of undersampling and oversampling models you can try out. To avoid removing information form the dataset, you can focus on oversampling techniques. You can try imbalanced-learn or smote-variants. Given enough data, using fully synthetic data is also an option, you can check ydata-synthetic for it. Let us know how it turned out!

knowledge-repo

2 5,429 4.1 Python

A next-generation curated knowledge sharing platform for data scientists and other technical professions.
Resume-Matcher

8 4,473 8.7 Python

Resume Matcher is an open source, free tool to improve your resume. It works by using language models to compare and rank resumes with job descriptions.

Project mention: Hacktoberfest 2023: The Complete Guide | dev.to | 2023-09-22

GitHub: https://github.com/srbhr/Resume-Matcher Website: https://www.resumematcher.fyi/ Discord: Resume Matcher's Discord Tech Stack: Python, NextJS, FastAPI, TypeScript

plotnine

36 3,809 9.7 Python

A Grammar of Graphics for Python

Project mention: FLaNK AI Weekly 18 March 2024 | dev.to | 2024-03-18

AWS Data Wrangler

9 3,797 9.4 Python

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06

I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool

missingno

5 3,771 1.9 Python

Missing data visualization module for Python.
running_page

3 3,229 9.0 Python

Make your own running home page

Project mention: Ask HN: Comment here about whatever you're passionate about at the moment | news.ycombinator.com | 2023-11-06

A resource recently shared in HN for running tech lovers https://github.com/yihong0618/running_page

igel

11 3,080 1.1 Python

a delightful machine learning tool that allows you to train, test, and use models without writing code
sweetviz

1 2,828 6.7 Python

Visualize and compare datasets, target values and associations, with one line of code.
pandas-datareader

3 2,812 6.3 Python

Extract data from a wide range of Internet sources into a pandas DataFrame.

Project mention: Seeking recommendations for forex economic data API | /r/algotrading | 2023-05-03

I've looked at https://github.com/pydata/pandas-datareader and it looks good, does anyone have experience?

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-10.

Python Data Analysis related posts

The Design Philosophy of Great Tables (Software Package)
7 projects | news.ycombinator.com | 4 Apr 2024
Show HN: Use an "eraser" to clean data on flight without breaking your workflow
1 project | news.ycombinator.com | 15 Mar 2024
Deploying a Serverless Dash App with AWS SAM and Lambda
3 projects | dev.to | 4 Mar 2024
Help Us Build Our Roadmap – Pydantic
2 projects | news.ycombinator.com | 19 Feb 2024
Show HN: File Hider
5 projects | news.ycombinator.com | 12 Jan 2024
Show HN: Data Painter – different way to interact with data in Jupyter notebook
1 project | news.ycombinator.com | 2 Jan 2024
Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres
1 project | news.ycombinator.com | 12 Dec 2023
A note from our sponsor - SaaSHub
www.saashub.com | 19 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Data Analysis projects in Python? This list will help you:

	Project	Stars
1	scikit-learn	57,985
2	Pandas	41,863
3	streamlit	31,361
4	gradio	28,556
5	best-of-ml-python	15,284
6	airbyte	13,821
7	ydata-profiling	11,992
8	pygwalker	9,660
9	statsmodels	9,513
10	mlcourse.ai	9,382
11	cleanlab	8,592
12	akshare	8,321
13	pyod	7,928
14	imbalanced-learn	6,687
15	knowledge-repo	5,429
16	Resume-Matcher	4,473
17	plotnine	3,809
18	AWS Data Wrangler	3,797
19	missingno	3,771
20	running_page	3,229
21	igel	3,080
22	sweetviz	2,828
23	pandas-datareader	2,812