Python Data Analysis

Open-source Python projects categorized as Data Analysis

Top 23 Python Data Analysis Projects

  • scikit-learn

    scikit-learn: machine learning in Python

    Project mention: Polars | news.ycombinator.com | 2024-01-08

    sklearn is adding support through the dataframe interchange protocol (https://github.com/scikit-learn/scikit-learn/issues/25896). scipy, as far as I know, doesn't explicitly support dataframes (it just happens to work when you wrap a Series in `np.array` or `np.asarray`). I don't know about PyTorch but in general you can convert to numpy.

  • Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

    Project mention: Help Us Build Our Roadmap – Pydantic | news.ycombinator.com | 2024-02-19

    there is pull request to integrate in both pydantic extra types and into pandas cose [1]

    [1]: https://github.com/pandas-dev/pandas/issues/53999

  • LearnThisRepo.com

    Learn 300+ open source libraries for free using AI. LearnThisRepo lets you learn 300+ open source repos including Postgres, Langchain, VS Code, and more by chatting with them using AI!

  • streamlit

    Streamlit — A faster way to build and share data apps.

    Project mention: Simplify Web App Development: Code Lite, Create Big! | dev.to | 2024-02-26

    Here's your savior, let's welcome Streamlit.

  • gradio

    Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!

    Project mention: Show HN: Dropbase – Build internal web apps with just Python | news.ycombinator.com | 2023-12-05

    There's also that library all the AI models started using that gives you a public URL to share. After researching it: https://www.gradio.app/ is the link.

    It's used specifically for making simple UIs for machine learning apps. But I guess technically you could use it for anything.

  • best-of-ml-python

    🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.

  • airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

    Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | news.ycombinator.com | 2023-12-12

    I'l also give a shout-out to Airbyte (https://airbyte.com/), with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.

    It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.

  • ydata-profiling

    1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

    Project mention: FLaNK 25 December 2023 | dev.to | 2023-12-26
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • statsmodels

    Statsmodels: statistical modeling and econometrics in Python

    Project mention: statsmodels Release Candidate 0.14.0rc0 tagged | /r/Python | 2023-04-26
  • mlcourse.ai

    Open Machine Learning Course

    Project mention: Open Machine Learning Course | news.ycombinator.com | 2023-10-22
  • pygwalker

    PyGWalker: Turn your pandas dataframe into an interactive UI for visual analysis

    Project mention: Show HN: Data Painter – different way to interact with data in Jupyter notebook | news.ycombinator.com | 2024-01-02
  • akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  • cleanlab

    The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

    Project mention: [Research] Detecting Annotation Errors in Semantic Segmentation Data | /r/MachineLearning | 2023-11-05

    We have feely open-sourced our new method for improving segmentation data, published a paper on the research behind it, and released a 5-min code tutorial. You can also read more in the blog if you'd like.

  • pyod

    A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)

    Project mention: A Comprehensive Guide for Building Rag-Based LLM Applications | news.ycombinator.com | 2023-09-13

    This is a feature in many commercial products already, as well as open source libraries like PyOD. https://github.com/yzhao062/pyod

  • imbalanced-learn

    A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

    Project mention: What’s your approach to highly imbalanced data sets? | /r/datascience | 2023-05-26

    There's a pletora of undersampling and oversampling models you can try out. To avoid removing information form the dataset, you can focus on oversampling techniques. You can try imbalanced-learn or smote-variants. Given enough data, using fully synthetic data is also an option, you can check ydata-synthetic for it. Let us know how it turned out!

  • knowledge-repo

    A next-generation curated knowledge sharing platform for data scientists and other technical professions.

  • Resume-Matcher

    Resume Matcher is an open source, free tool to improve your resume. It works by using language models to compare and rank resumes with job descriptions.

    Project mention: Hacktoberfest 2023: The Complete Guide | dev.to | 2023-09-22

    GitHub: https://github.com/srbhr/Resume-Matcher Website: https://www.resumematcher.fyi/ Discord: Resume Matcher's Discord Tech Stack: Python, NextJS, FastAPI, TypeScript

  • missingno

    Missing data visualization module for Python.

  • plotnine

    A Grammar of Graphics for Python

    Project mention: A look at the Mojo language for bioinformatics | news.ycombinator.com | 2024-02-11

    To your last point, have you tried plotnine? It's meant to be ggplot2 for python.

    https://github.com/has2k1/plotnine

  • AWS Data Wrangler

    pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

    Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06

    I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool

  • running_page

    Make your own running home page

    Project mention: Ask HN: Comment here about whatever you're passionate about at the moment | news.ycombinator.com | 2023-11-06

    A resource recently shared in HN for running tech lovers https://github.com/yihong0618/running_page

  • igel

    a delightful machine learning tool that allows you to train, test, and use models without writing code

  • pandas-datareader

    Extract data from a wide range of Internet sources into a pandas DataFrame.

    Project mention: Seeking recommendations for forex economic data API | /r/algotrading | 2023-05-03

    I've looked at https://github.com/pydata/pandas-datareader and it looks good, does anyone have experience?

  • sweetviz

    Visualize and compare datasets, target values and associations, with one line of code.

  • WorkOS

    The modern API for authentication & user identity. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-02-26.

Python Data Analysis related posts

Index

What are some of the best open-source Data Analysis projects in Python? This list will help you:

Project Stars
1 scikit-learn 57,481
2 Pandas 41,332
3 streamlit 30,359
4 gradio 26,638
5 best-of-ml-python 15,098
6 airbyte 13,265
7 ydata-profiling 11,837
8 statsmodels 9,331
9 mlcourse.ai 9,308
10 pygwalker 8,937
11 akshare 7,979
12 cleanlab 7,947
13 pyod 7,824
14 imbalanced-learn 6,642
15 knowledge-repo 5,417
16 Resume-Matcher 4,308
17 missingno 3,771
18 plotnine 3,748
19 AWS Data Wrangler 3,745
20 running_page 3,078
21 igel 3,078
22 pandas-datareader 2,791
23 sweetviz 2,789
The modern API for authentication & user identity.
The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
workos.com