Top 23 Data Open-Source Projects

  • TanStack Query

    🤖 Powerful asynchronous state management, server-state utilities and data fetching for the web. TS/JS, React Query, Solid Query, Svelte Query and Vue Query.

  • Project mention: Best Next.js Libraries and Tools in 2024 | | 2024-04-10
  • Metabase

    The simplest, fastest way to get business intelligence and analytics to everyone in your company :yum:

  • Project mention: HackTheBox - Writeup Analytics | | 2024-03-30

    Remote Code Execution via H2

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • SheetJS js-xlsx

    📗 SheetJS Spreadsheet Data Toolkit -- New home

  • Project mention: how to work with .xlsx files? | /r/node | 2023-06-28

    ExcelJS and XLSX (SheetJS) are great libraries to work with XLSX files. The former I've found a bit easier to work with but less efficient in general.

  • llama_index

    LlamaIndex is a data framework for your LLM applications

  • Project mention: LlamaIndex: A data framework for your LLM applications | | 2024-04-07
  • SWR

    React Hooks for Data Fetching

  • Project mention: Think Twice Before Using setInterval() for API Polling – It Might Not Be Ideal | | 2024-05-19

    Data fetching is a very simple task until it becomes complicated. Therefore I recommend that you use powerful data fetching libraries like TanStack Query, SWR, RTK Query.

  • data

    Data and code behind the articles and graphics at FiveThirtyEight

  • Project mention: Mastering Dataset Acquisition: A Comprehensive Guide | | 2024-05-03

    FiveThirtyEight Datasets: Datasets related to articles and investigations published by FiveThirtyEight. FiveThirtyEight Datasets

  • Presto

    The official home of the Presto distributed SQL query engine for big data

  • Project mention: Multi-Database Support in DuckDB | | 2024-01-28

    We have some of this functionality in Presto (, but it takes fair bit of work to implement it for all the different backends.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

  • Project mention: Prefect: A workflow orchestration tool for data pipelines | | 2024-03-13
  • airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

  • Project mention: How to Build a Chat App with Your Postgres Data using Agent Cloud | | 2024-05-13

    AgentCloud uses Airbyte to build data pipelines, which allow us to split, chunk, and embed data from over 300 data sources, including Postgres.

  • awesome-bigdata

    A curated list of awesome big data frameworks, ressources and other awesomeness.

  • Project mention: Good coding groups for black women? | | 2024-01-13
  • faker

    Generate massive amounts of fake data in the browser and node.js (by faker-js)

  • Project mention: Easily create mock data for unit tests 🧪 | | 2024-02-15

    Instead of manually having to think of defaults for your interface properties, you could use Faker.

  • pandas-ai

    Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.

  • Project mention: PandasAI is great but is there a more general library? | | 2023-08-23
  • chinese-xinhua

    :orange_book: 中华新华字典数据库。包括歇后语,成语,词语,汉字。

  • prql

    PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement

  • Project mention: Prolog language for PostgreSQL proof of concept | | 2024-03-30
  • semantic-source

    Parsing, analyzing, and comparing source code across many languages

  • Project mention: The Meaning of Monad in MonadTrans | | 2023-08-13

    One production example I know: GitHub code navigation is written in Haskell

  • akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  • Bogus

    :card_index: A simple fake data generator for C#, F#, and VB.NET. Based on and ported from the famed faker.js.

  • Project mention: Bogus custom Dataset | | 2024-01-06

    Bogus NuGet package is fake data generator which can be helpful for populating tables in a database and testing purposes. If a database is not used and Bogus populates list of data each time an application runs, the data is random, never the same. Also, the random data generated by Bogus may not meet a developer’s requirements.

  • machine-learning-roadmap

    A roadmap connecting many of the most important concepts in machine learning, how to learn them and what tools to use to perform them.

  • Project mention: Best AI ML DL DS Roadmap | /r/deeplearning | 2023-12-07

    **[Mrdbourke/machine-learning-roadmap on GitHub](**: This GitHub repository is more focused on machine learning. It's a good choice if you're looking for a more community-driven approach, as GitHub repositories often encourage contributions and updates from various experts.

  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data.

  • Project mention: FLaNK AI-April 22, 2024 | | 2024-04-22
  • Snowplow

    The enterprise-grade behavioral data engine (web, mobile, server-side, webhooks), running cloud-natively on AWS and GCP

  • kestra

    Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.

  • Project mention: A High-Performance, Java-Based Orchestration Platform | /r/java | 2023-10-11

    Kestra's communication is asynchronous and based on a queuing mechanism. It leverages the Micronaut framework and offers two runners: one that uses a database (JDBC) for both the message queue and resource storage, and another that uses Kafka as the message queue and Elasticsearch as the resource storage. The platform is fully extensible and plugin-based, providing a rich set of plugins for various workflow tasks, triggers, and data storage options. For those interested, the GitHub repository is available here:

  • Tabulator

    Interactive Tables and Data Grids for JavaScript

  • Project mention: Tabulator – JavaScript Tables and Data Grids | | 2024-02-09
  • Parsr

    Transforms PDF, Documents and Images into Enriched Structured Data

  • Project mention: LlamaCloud and LlamaParse | | 2024-02-20

    I'm part of the team that build LlamaParse. It's net improvement compare to other PDF->Structured Text extractors (I build several in the past, includig

    For character extraction, LlamaParse use a mixture of OCR / character extraction from the PDF (it's the only parser I'm aware of that address some of the buggy PDF font issues, check the 'text' mode to see raw document before reconstruction), use a mixture of heuristic and Machine learning models to reconstruct the document.

    Once plug with a Recursive retrieval strategy, allow you to get Sota result on question answering over complexe text (see notebook:


  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Data related posts

  • Firecrawl: Turn entire websites into LLM-ready Markdown or structured data

    1 project | | 18 May 2024
  • Ask HN: Anyone dislike GitHub action yml syntax

    1 project | | 17 May 2024
  • Really: Policy language for infra that doesn't suck

    1 project | | 16 May 2024
  • How to Build a Chat App with Your Postgres Data using Agent Cloud

    3 projects | | 13 May 2024
  • Pg_lakehouse: Query Any Data Lake from Postgres

    1 project | | 12 May 2024
  • A Language Without Comments

    1 project | | 12 May 2024
  • Mastering Dataset Acquisition: A Comprehensive Guide

    2 projects | | 3 May 2024
  • A note from our sponsor - InfluxDB | 20 May 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →


What are some of the best open-source Data projects? This list will help you:

Project Stars
1 TanStack Query 40,102
2 Metabase 36,784
3 SheetJS js-xlsx 34,554
4 llama_index 31,628
5 SWR 29,562
6 data 16,649
7 Presto 15,624
8 Prefect 14,829
9 airbyte 14,296
10 awesome-bigdata 12,845
11 faker 11,857
12 pandas-ai 11,214
13 chinese-xinhua 10,682
14 prql 9,470
15 semantic-source 8,879
16 akshare 8,479
17 Bogus 8,366
18 machine-learning-roadmap 7,164
19 Mage 7,131
20 Snowplow 6,751
21 kestra 6,605
22 Tabulator 6,237
23 Parsr 5,662

SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives