Data Analysis

Top 23 Data Analysis Open-Source Projects

  • superset

    Apache Superset is a Data Visualization and Data Exploration Platform

  • Project mention: Apache Superset | news.ycombinator.com | 2024-02-26

    Superset is absolutely phenomenal. I really hope Microsoft eventually releases all of their customizations they made to it internally to the OS community someday.

    https://www.youtube.com/watch?v=RY0SSvSUkMA

    https://github.com/apache/superset/discussions/20094

  • scikit-learn

    scikit-learn: machine learning in Python

  • Project mention: AutoCodeRover resolves 22% of real-world GitHub in SWE-bench lite | news.ycombinator.com | 2024-04-09

    Thank you for your interest. There are some interesting examples in the SWE-bench-lite benchmark which are resolved by AutoCodeRover:

    - From sympy: https://github.com/sympy/sympy/issues/13643. AutoCodeRover's patch for it: https://github.com/nus-apr/auto-code-rover/blob/main/results...

    - Another one from scikit-learn: https://github.com/scikit-learn/scikit-learn/issues/13070. AutoCodeRover's patch (https://github.com/nus-apr/auto-code-rover/blob/main/results...) modified a few lines below (compared to the developer patch) and wrote a different comment.

    There are more examples in the results directory (https://github.com/nus-apr/auto-code-rover/tree/main/results).

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

  • Project mention: Deploying a Serverless Dash App with AWS SAM and Lambda | dev.to | 2024-03-04

    Dash is a Python framework that enables you to build interactive frontend applications without writing a single line of Javascript. Internally and in projects we like to use it in order to build a quick proof of concept for data driven applications because of the nice integration with Plotly and pandas. For this post, I'm going to assume that you're already familiar with Dash and won't explain that part in detail. Instead, we'll focus on what's necessary to make it run serverless.

  • Metabase

    The simplest, fastest way to get business intelligence and analytics to everyone in your company :yum:

  • Project mention: HackTheBox - Writeup Analytics | dev.to | 2024-03-30

    Remote Code Execution via H2

  • streamlit

    Streamlit — A faster way to build and share data apps.

  • Project mention: Creating a Sales Analysis Application with Streamlit: A Practical Approach to Business Intelligence | dev.to | 2024-04-19

    2.-Go to https://streamlit.io, log in, and create a new app from your GitHub repository.

  • gradio

    Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!

  • Project mention: Show HN: Dropbase – Build internal web apps with just Python | news.ycombinator.com | 2023-12-05

    There's also that library all the AI models started using that gives you a public URL to share. After researching it: https://www.gradio.app/ is the link.

    It's used specifically for making simple UIs for machine learning apps. But I guess technically you could use it for anything.

  • AI-Expert-Roadmap

    Roadmap to becoming an Artificial Intelligence Expert in 2022

  • Project mention: Best AI ML DL DS Roadmap | /r/deeplearning | 2023-12-07

    **[I.am.ai AI Expert Roadmap](https://i.am.ai/roadmap)**: This roadmap focuses more on AI and includes various aspects of machine learning and deep learning. It's suitable for those who want to delve deeper into AI, particularly in cutting-edge research and applications.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • Data-Science-For-Beginners

    10 Weeks, 20 Lessons, Data Science for All!

  • Project mention: Welcome to 14 days of Data Science! | dev.to | 2024-03-07

    Get started with Data Science in the Data Science for Beginners curricula.

  • CyberChef

    The Cyber Swiss Army Knife - a web app for encryption, encoding, compression and data analysis

  • Project mention: PicoCTF 2024: packer | dev.to | 2024-04-05

    Then we take the encrypted text and use CyberChef to decrypt it.

  • GoAccess

    GoAccess is a real-time web log analyzer and interactive viewer that runs in a terminal in *nix systems or through your browser.

  • Project mention: You don't need analytics on your blog | news.ycombinator.com | 2023-12-24

    If one wants server-side metrics with a little more info than the author's "hacky little script", there's always goaccess [1], which functions in broadly the same way. I even use it with Firebase Hosting-hosted sites via [2] (which I wrote).

    [1] http://goaccess.io/

    [2] https://github.com/Silicon-Ally/gcp-clf

  • best-of-ml-python

    🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.

  • airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

  • Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | news.ycombinator.com | 2023-12-12

    I'l also give a shout-out to Airbyte (https://airbyte.com/), with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.

    It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.

  • ydata-profiling

    1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

  • Project mention: FLaNK 25 December 2023 | dev.to | 2023-12-26
  • OpenRefine

    OpenRefine is a free, open source power tool for working with messy data and improving it

  • Project mention: Ask HN: What Underrated Open Source Project Deserves More Recognition? | news.ycombinator.com | 2024-03-07

    "OpenRefine is a powerful free, open source tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data." https://openrefine.org/

  • pandas_exercises

    Practice your pandas skills!

  • pygwalker

    PyGWalker: Turn your pandas dataframe into an interactive UI for visual analysis

  • Project mention: Show HN: Use an "eraser" to clean data on flight without breaking your workflow | news.ycombinator.com | 2024-03-15
  • statsmodels

    Statsmodels: statistical modeling and econometrics in Python

  • Project mention: statsmodels Release Candidate 0.14.0rc0 tagged | /r/Python | 2023-04-26
  • mlcourse.ai

    Open Machine Learning Course

  • Project mention: Open Machine Learning Course | news.ycombinator.com | 2023-10-22
  • cleanlab

    The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

  • Project mention: [Research] Detecting Annotation Errors in Semantic Segmentation Data | /r/MachineLearning | 2023-11-05

    We have feely open-sourced our new method for improving segmentation data, published a paper on the research behind it, and released a 5-min code tutorial. You can also read more in the blog if you'd like.

  • akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  • pyod

    A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)

  • Project mention: A Comprehensive Guide for Building Rag-Based LLM Applications | news.ycombinator.com | 2023-09-13

    This is a feature in many commercial products already, as well as open source libraries like PyOD. https://github.com/yzhao062/pyod

  • cudf

    cuDF - GPU DataFrame Library

  • Project mention: A Polars exploration into Kedro | dev.to | 2023-05-17

    The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.

  • gonum

    Gonum is a set of numeric libraries for the Go programming language. It contains libraries for matrices, statistics, optimization, and more

  • Project mention: How to set up interface to accept multi-dimension array? | /r/golang | 2023-07-13

    But if you want to see what can be done for numeric stuff, check out gonum. Personally, I still wouldn't use Go, and I rather suspect it's still pretty easy to reach for something like what you're trying to do and not find it because Go just can't write that type sensibly, but you can at least see what is available, written by people who disagree with me about Go not being a great language for this.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Data Analysis related posts

Index

What are some of the best open-source Data Analysis projects? This list will help you:

Project Stars
1 superset 58,737
2 scikit-learn 58,046
3 Pandas 41,923
4 Metabase 36,417
5 streamlit 31,506
6 gradio 28,556
7 AI-Expert-Roadmap 28,388
8 Data-Science-For-Beginners 26,290
9 CyberChef 25,384
10 GoAccess 17,467
11 best-of-ml-python 15,302
12 airbyte 13,923
13 ydata-profiling 12,022
14 OpenRefine 10,448
15 pandas_exercises 10,159
16 pygwalker 9,759
17 statsmodels 9,534
18 mlcourse.ai 9,390
19 cleanlab 8,592
20 akshare 8,321
21 pyod 7,928
22 cudf 7,257
23 gonum 7,249

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com