Python Data Analysis

Open-source Python projects categorized as Data Analysis

Top 23 Python Data Analysis Projects

  • scikit-learn

    scikit-learn: machine learning in Python

    Project mention: Best Websites For Coders | | 2023-01-25

    Scikit-learn : A Python module for machine learning build on top of SciPy

  • Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

    Project mention: How to query pandas DataFrames with SQL | | 2023-02-01

    Pandas is a go-to tool for tabular data management, processing, and analysis in Python, but sometimes you may want to go from pandas to SQL.

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • streamlit

    Streamlit — The fastest way to build data apps in Python

    Project mention: What are you guys using for making GUIs nowadays? | | 2023-01-26

    - For a PoC / localhost / web usage :

  • best-of-ml-python

    🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.

    Project mention: Ask HN: How to get back into AI? | | 2022-12-10

    For Python, here's a nice compilation:

  • pandas-profiling

    Create HTML profiling reports from pandas DataFrame objects

    Project mention: pandas-profiling VS Rath - a user suggested alternative | | 2023-01-12

    Open Machine Learning Course

    Project mention: NEW Courses - star count:8584.0 | | 2023-02-01
  • statsmodels

    Statsmodels: statistical modeling and econometrics in Python

    Project mention: [P] statsmodels.tsa.holtwinters.ExponentialSmoothing results in NaN forecasts and parameters when fitting on entire dataset using known parameters from training model. | | 2022-11-19

    I reckon you're more likely to get a good response on their Github page than here. Unless a dev happens to see this post.

  • InfluxDB

    Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Platform where developers build real-time applications for analytics, IoT and cloud-native services. Easy to start, it is available in the cloud or on-premises.

  • pyod

    A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)

    Project mention: Pyod – A Comprehensive and Scalable Python Library for Outlier Detection | | 2022-08-10
  • akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  • knowledge-repo

    A next-generation curated knowledge sharing platform for data scientists and other technical professions.

    Project mention: How do you document a ML research? | | 2022-08-30

    While a start, a few that just being a markdown is editor is not enough, GitHub and GitLab already have this sort of wiki. I feel something like provides a better experience, since it gives an incentive for Data Scientists to make their source notebook well documented, and be a SSoT. With a Wiki like, if you change something on the original project, you need to remind yourself to update your reports. If your notebook is in itself your report, that's not necessary. Plus, it would benefit from the Semantic Diffs that DagsHub already have implemented.

  • missingno

    Missing data visualization module for Python.

    Project mention: #VisualizationTip: Using Seaborn(Heatmap) to visualize Missing data( Yellow- Representation of a column's missing data.) | | 2022-10-04

    Good job, but I would recommend missingno it's a powerful module for missing values visualization.

  • plotnine

    A grammar of graphics for Python

    Project mention: Is R or Python an EASIER option for non-CS/SE grads? | | 2022-12-12

    You could use plotnine if you like the grammar of graphics concept:

  • AWS Data Wrangler

    pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

    Project mention: I agree that Arrow Tables are great, but we decided to keep the library focused on the Pandas interface. [wont implement] | | 2022-09-21
  • flyte

    Kubernetes-native workflow automation platform for complex, mission-critical data and ML processes at scale. It has been battle-tested at Lyft, Spotify, Freenome, and others and is truly open-source.

    Project mention: Github alternative for ML? | | 2023-01-26

    Have you looked at It aims to bring "versioning", "compute" and "reproducibility" together in one package.

  • igel

    a delightful machine learning tool that allows you to train, test, and use models without writing code

  • pandas-datareader

    Extract data from a wide range of Internet sources into a pandas DataFrame.

    Project mention: pandas datareader get_data_yahoo(), Fix question, I have heard they changed something to make it secure | | 2023-01-21
  • sweetviz

    Visualize and compare datasets, target values and associations, with one line of code.

  • Cubes

    Light-weight Python OLAP framework for multi-dimensional data analysis

  • nannyml

    Detecting silent model failure. NannyML estimates performance for regression and classification models using tabular data. It alerts you when and why it changed. It is the only open-source library capable of fully capturing the impact of data drift on performance.

    Project mention: [HIRING][Full Time, Part Time, Temporary, Internship, Freelance] Data Science Intern (Remote) | | 2022-05-20

    Description NannyML - creators of an Open Source Python library, are looking for multiple Data Science interns to help across research, prototyping, and product. Github: About Us NannyML is an Open Source Python lib …

  • pycm

    Multi-class confusion matrix library in Python

    Project mention: PyCM 3.8 Released: Distance/Similarity Support | | 2023-02-01
  • Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • DataProfiler

    What's in your data? Extract schema, statistics and entities from datasets

    Project mention: Release 0.8.3 · capitalone/DataProfiler | | 2022-11-14
  • siuba

    Python library for using dplyr like syntax with pandas and SQL

    Project mention: Best alternative to Pandas 2023? | | 2023-01-13

    I don't know what's best for you, but I can recommend Siuba, a tidy interface for Python to send queries to pandas and SQL-db.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-02-01.

Python Data Analysis related posts


What are some of the best open-source Data Analysis projects in Python? This list will help you:

Project Stars
1 scikit-learn 52,699
2 Pandas 36,692
3 streamlit 22,333
4 best-of-ml-python 12,524
5 pandas-profiling 10,067
6 8,586
7 statsmodels 8,135
8 pyod 6,677
9 akshare 5,861
10 knowledge-repo 5,260
11 missingno 3,440
12 plotnine 3,336
13 AWS Data Wrangler 3,297
14 flyte 3,039
15 igel 3,023
16 pandas-datareader 2,555
17 sweetviz 2,305
18 Cubes 1,481
19 nannyml 1,362
20 pycm 1,347
21 Optimus 1,337
22 DataProfiler 1,084
23 siuba 985
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives