Python Data Analysis

Open-source Python projects categorized as Data Analysis | Edit details

Top 23 Python Data Analysis Projects

  • GitHub repo scikit-learn

    scikit-learn: machine learning in Python

    Project mention: Data Science toolset summary from 2021 | | 2021-11-13

    Scikit-learn - It is one of the most widely used frameworks for Python based Data science tasks. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Link -

  • GitHub repo Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

    Project mention: Why df.drop_duplicates() doesn't work for me? | | 2021-12-02

    There is discussion of the inplace argument becoming deprecated across the entire pandas API.

  • Scout APM

    Scout APM: A developer's best friend. Try free for 14-days. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.

  • GitHub repo streamlit

    Streamlit — The fastest way to build data apps in Python

    Project mention: Suggestions for GUI framework for an app to browse tables of data, with buttons and dropdown menus in cells? And some related PySimpleGui questions | | 2021-11-07

    I've never used it, but someone suggested it in another thread, and it looked interesting to me, so I have it bookmarked to try-out:

  • GitHub repo statsmodels

    Statsmodels: statistical modeling and econometrics in Python

    Project mention: Advice required to choose appropriate software for an assignment | | 2021-04-26

    Can't you get a student discount for Stata? R would definitely be able to handle everything. For Python, have a look through the statsmodel package

  • GitHub repo pyod

    A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)

    Project mention: [D] Unsupervised Outlier Detection - Advise Requested | | 2021-12-03

    The source code and documentaion of PyOD is the best survey about OOD. Besides, the normalized flow and VQVAE are also feasible.

  • GitHub repo knowledge-repo

    A next-generation curated knowledge sharing platform for data scientists and other technical professions.

    Project mention: How does everyone share their models etc. across teams for re-use effectively? | | 2021-05-22
  • GitHub repo akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库

  • Nanos

    Run Linux Software Faster and Safer than Linux with Unikernels.

  • GitHub repo gradio

    Create UIs for your machine learning model in Python in 3 minutes

    Project mention: AnimeGan V2 | | 2021-11-07
  • GitHub repo missingno

    Missing data visualization module for Python.

    Project mention: For all the python/pandas users out there I just released a bunch of UI updates to the free visualizer, D-Tale | | 2021-04-12

    analysis of "Missing" data using the missingno package is now available in a sliding side panel enlarge or download PNG files for matrix/bar/heatmap/dendrogram charts generated using missingno

  • GitHub repo igel

    a delightful machine learning tool that allows you to train, test, and use models without writing code

    Project mention: Train/fit, test, and use models without writing code | | 2021-06-29

    Link to the repo:

  • GitHub repo plotnine

    A grammar of graphics for Python

    Project mention: Book Recommendations Matlab->Python? | | 2021-11-02

    plotnine is a Python port of ggplot.

  • GitHub repo AWS Data Wrangler

    Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

    Project mention: Redshift API vs. other ways to connect? | | 2021-10-21

    awslabs has developed their own package for this and given it's for their product, seem likely to maintain it.

  • GitHub repo pandas-datareader

    Extract data from a wide range of Internet sources into a pandas DataFrame.

    Project mention: Best quantitative tools/repos/apis for Sentiment & Social Media analysis of individual Stock/Crypto tickers | | 2021-07-03

    Also Yahoo continually takes steps to discourage programmatic access (the most recent attempt is happening right now:

  • GitHub repo sweetviz

    Visualize and compare datasets, target values and associations, with one line of code.

    Project mention: Automated Data Profiling and Attribute Clustering using unsupervised ML techniques | | 2021-07-03

    Take a look at this package which computes associations between variables and other viz and can infer some types

  • GitHub repo flyte

    Kubernetes-native workflow automation platform for complex, mission-critical data and ML processes at scale. It has been battle-tested at Lyft, Spotify, Freenome, and others and is truly open-source.

    Project mention: Hacktoberfest: Flytesnacks Project "update tuple output examples" | | 2021-11-01

    I chose the flytekit project, which is one of the component repos of flyte and is the python SDK and tools of the Flyte project

  • GitHub repo Cubes

    Light-weight Python OLAP framework for multi-dimensional data analysis

    Project mention: Building data analysis apps | | 2021-04-16

    I'm looking for materials and tools to learn. I'm reading up on OLAP and cubes. I found cubes python package but it hasn't been updated in years. Could you give me some tips on what to learn in 2021?

  • GitHub repo vectorbt

    Next-gen framework for backtesting, algorithmic trading, and research. Blazingly fast. Super accurate. Pandas friendly.

    Project mention: Looking for active python backtesting framework | | 2021-02-09

    However, it's not the fastest framework. If you need speed, and are good with the data science tool chain in python and the concept of flattening loops into vectorized operations, check out vector-bt. I haven't gotten a chance to play with it yet, but I'm definitely going to as soon as I find some spare time. It seems like a great option with a nicely modernized approach.

  • GitHub repo pycm

    Multi-class confusion matrix library in Python

    Project mention: [P] PyCM 3.3 released: Comparison of Classifiers Based on Confusion Matrix | | 2021-10-27
  • GitHub repo Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • GitHub repo nfstream

    NFStream: a Flexible Network Data Analysis Framework.

    Project mention: Open Source Deep Packet Inspection Using Python | | 2021-07-02

    GitHub project:

    Community feedbacks and contributions are welcome!

  • GitHub repo siuba

    Python library for using dplyr like syntax with pandas and SQL

    Project mention: Going from R to Pandas: dplython vs dfply vs plydata | | 2021-09-30

    You should follow /u/the75th's advice. However, if you decide to buck that take, I'd look into siuba. I've never heard of those packages you've listed, and have doubts they'd be maintained.

  • GitHub repo DataProfiler

    What's in your data? Extract schema, statistics and entities from datasets

    Project mention: [P] DataProfiler - Scaleable Sensitive Data Detection & Analysis on Structured & Unstructured Files | | 2021-11-22
  • GitHub repo redata

    re_data - fix data issues before your users & CEO would discover them 😊

    Project mention: great_expectations VS redata - a user suggested alternative | | 2021-09-24

    It's more convenient when you are already using dbt and don't want to set up a separate workflow for testing data when it can be done with dbt inside the data warehouse. Also the thing re_data does well is letting you create time-based metrics about your data quality instead of just tests (a lot of the tests can be rewritten to that) That allows you to do a couple of things more than GE, you can for example easily visualize or look for anomalies in those. You can also compute tests much more efficiently. Research about computing metrics as a good way of doing data quality was actually done by the team behind deequ: I'm the author, so obviously I'm a bit biased :)

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-12-03.

Python Data Analysis related posts


What are some of the best open-source Data Analysis projects in Python? This list will help you:

Project Stars
1 scikit-learn 48,081
2 Pandas 31,817
3 streamlit 16,661
4 statsmodels 6,874
5 pyod 5,040
6 knowledge-repo 4,937
7 akshare 4,270
8 gradio 4,148
9 missingno 3,004
10 igel 2,952
11 plotnine 2,861
12 AWS Data Wrangler 2,339
13 pandas-datareader 2,165
14 sweetviz 1,823
15 flyte 1,796
16 Cubes 1,434
17 vectorbt 1,428
18 pycm 1,186
19 Optimus 1,139
20 nfstream 740
21 siuba 727
22 DataProfiler 694
23 redata 622
Find remote jobs at our new job board There are 33 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives