Python Data Analysis

Open-source Python projects categorized as Data Analysis | Edit details

Top 23 Python Data Analysis Projects

  • GitHub repo scikit-learn

    scikit-learn: machine learning in Python

    Project mention: scikit-learn test case results? | | 2022-01-05
  • GitHub repo Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

    Project mention: Best Data Structure for this? | | 2022-01-17

    If you really want to store it all (labels included) in one data structure, you should look up pandas.

  • SonarQube

    Static code analysis for 29 languages.. Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.

  • GitHub repo streamlit

    Streamlit — The fastest way to build data apps in Python

    Project mention: How to Build a Machine Learning Demo in 2022 | | 2022-01-16

    So what if you want something almost as flexible as what is possible with the full-stack approach, but without the development requirements? Well, you are in luck because the past few years have seen the emergence of Python libraries that allow the creation of impressively interactive demos with only a few lines of code. In this article, we are going to focus on two of the most promising libraries: Gradio and Streamlit. There are notable differences between the two that will be explored below, but the high level idea is the same: eliminate most of the painful back and front end work outlined in the full-stack section, albeit at the cost of some flexibility.

  • GitHub repo statsmodels

    Statsmodels: statistical modeling and econometrics in Python

    Project mention: Advice required to choose appropriate software for an assignment | | 2021-04-26

    Can't you get a student discount for Stata? R would definitely be able to handle everything. For Python, have a look through the statsmodel package

  • GitHub repo pyod

    (JMLR' 19) A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)

    Project mention: [D] Unsupervised Outlier Detection - Advise Requested | | 2021-12-03

    The source code and documentaion of PyOD is the best survey about OOD. Besides, the normalized flow and VQVAE are also feasible.

  • GitHub repo knowledge-repo

    A next-generation curated knowledge sharing platform for data scientists and other technical professions.

    Project mention: How does everyone share their models etc. across teams for re-use effectively? | | 2021-05-22
  • GitHub repo gradio

    Create UIs for your machine learning model in Python in 3 minutes

    Project mention: I automated my job over a year ago and haven't told anyone. | | 2022-01-12

    Interesting, never heard about TK or QT. I've been using streamlit and Gradio as GUIs for my Python scripts which have been awesome but it seems like comparing to something like QT that it is much more robust and customizable than what I'm using.

  • OPS

    OPS - Build and Run Open Source Unikernels. Quickly and easily build and deploy open source unikernels in tens of seconds. Deploy in any language to any cloud.

  • GitHub repo akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  • GitHub repo missingno

    Missing data visualization module for Python.

    Project mention: For all the python/pandas users out there I just released a bunch of UI updates to the free visualizer, D-Tale | | 2021-04-12

    analysis of "Missing" data using the missingno package is now available in a sliding side panel enlarge or download PNG files for matrix/bar/heatmap/dendrogram charts generated using missingno

  • GitHub repo igel

    a delightful machine learning tool that allows you to train, test, and use models without writing code

    Project mention: Train/fit, test, and use models without writing code | | 2021-06-29

    Link to the repo:

  • GitHub repo dtale

    Visualizer for pandas data structures

    Project mention: Show HN: D-Tale, easy to use pandas GUI | | 2021-11-01
  • GitHub repo plotnine

    A grammar of graphics for Python

    Project mention: Should I learn matplotlib in 2022? | | 2022-01-09

    If you are familiar with R or ggplot, I recommend using plotnine. It implements ggplot2 (the well-known graphics package for R) in Python. In fact, plotnine is just a wrapper of matplotlib. However, it is a little more convenient than pure matplotlib.

  • GitHub repo AWS Data Wrangler

    Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

    Project mention: Automate some wrangling and data visualization in Python | | 2022-01-03
  • GitHub repo pandas-datareader

    Extract data from a wide range of Internet sources into a pandas DataFrame.

    Project mention: Best quantitative tools/repos/apis for Sentiment & Social Media analysis of individual Stock/Crypto tickers | | 2021-07-03

    Also Yahoo continually takes steps to discourage programmatic access (the most recent attempt is happening right now:

  • GitHub repo sweetviz

    Visualize and compare datasets, target values and associations, with one line of code.

    Project mention: Automated Data Profiling and Attribute Clustering using unsupervised ML techniques | | 2021-07-03

    Take a look at this package which computes associations between variables and other viz and can infer some types

  • GitHub repo flyte

    Kubernetes-native workflow automation platform for complex, mission-critical data and ML processes at scale. It has been battle-tested at Lyft, Spotify, Freenome, and others and is truly open-source.

    Project mention: Hacktoberfest: Flytesnacks Project "update tuple output examples" | | 2021-11-01

    I chose the flytekit project, which is one of the component repos of flyte and is the python SDK and tools of the Flyte project

  • GitHub repo vectorbt

    Find your trading edge, using the fastest engine for backtesting, algorithmic trading, and research.

    Project mention: Repost with explanation - OOS Testing cluster | | 2022-01-01

    I second the idea of looking through software optimization, but there is no need to jump right to C. I would look at something like vectorbt. You get the speed of C running under the hood while staying in Python for your back testing code

  • GitHub repo Cubes

    Light-weight Python OLAP framework for multi-dimensional data analysis

    Project mention: Building data analysis apps | | 2021-04-16

    I'm looking for materials and tools to learn. I'm reading up on OLAP and cubes. I found cubes python package but it hasn't been updated in years. Could you give me some tips on what to learn in 2021?

  • GitHub repo pycm

    Multi-class confusion matrix library in Python

    Project mention: [P] PyCM 3.3 released: Comparison of Classifiers Based on Confusion Matrix | | 2021-10-27
  • GitHub repo Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • GitHub repo nfstream

    NFStream: a Flexible Network Data Analysis Framework.

    Project mention: Open Source Deep Packet Inspection Using Python | | 2021-07-02

    GitHub project:

    Community feedbacks and contributions are welcome!

  • GitHub repo siuba

    Python library for using dplyr like syntax with pandas and SQL

    Project mention: Going from R to Pandas: dplython vs dfply vs plydata | | 2021-09-30

    You should follow /u/the75th's advice. However, if you decide to buck that take, I'd look into siuba. I've never heard of those packages you've listed, and have doubts they'd be maintained.

  • GitHub repo DataProfiler

    What's in your data? Extract schema, statistics and entities from datasets

    Project mention: Miller – tool for querying, shaping, reformatting data in CSV, TSV, and JSON | | 2021-12-22

    My team built a similar tool in Python to load any delimited file, json, parquet and Avro with one command:

    Effectively loads anything into a dataframe

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-01-17.

Python Data Analysis related posts


What are some of the best open-source Data Analysis projects in Python? This list will help you:

Project Stars
1 scikit-learn 48,549
2 Pandas 32,341
3 streamlit 17,351
4 statsmodels 7,006
5 pyod 5,181
6 knowledge-repo 4,985
7 gradio 4,755
8 akshare 4,446
9 missingno 3,039
10 igel 2,961
11 dtale 2,914
12 plotnine 2,904
13 AWS Data Wrangler 2,445
14 pandas-datareader 2,203
15 sweetviz 1,883
16 flyte 1,839
17 vectorbt 1,562
18 Cubes 1,434
19 pycm 1,200
20 Optimus 1,159
21 nfstream 753
22 siuba 742
23 DataProfiler 735
Find remote jobs at our new job board There are 29 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
Less time debugging, more time building
Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.