Data

Top 23 Data Open-Source Projects

  1. TanStack Query

    🤖 Powerful asynchronous state management, server-state utilities and data fetching for the web. TS/JS, React Query, Solid Query, Svelte Query and Vue Query.

    Project mention: How To Fetch The Data From API In React JS | dev.to | 2025-03-17

    React Query Documentation

  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. Metabase

    The easy-to-use open source Business Intelligence and Embedded Analytics tool that lets everyone work with data :bar_chart:

    Project mention: Metabase 52: Sankey charts, faster search, and more | dev.to | 2024-12-11

    Hope you enjoy the release. If you want to get into the nitty-gritty, check out our release notes in GitHub. To see what other features we have in the works, see our product roadmap.

  4. llama_index

    LlamaIndex is the leading framework for building LLM-powered agents over your data.

    Project mention: Quick tip: Replace MongoDB® Atlas with SingleStore Kai in LlamaIndex | dev.to | 2025-01-21

    The notebook is adapted from the LlamaIndex GitHub repo.

  5. SheetJS js-xlsx

    📗 SheetJS Spreadsheet Data Toolkit -- New home https://git.sheetjs.com/SheetJS/sheetjs

    Project mention: Building an inventory management app: 'Invento' as a Beginner Developer | dev.to | 2024-07-24

    XLSX : XLSX is a library for parsing and writing Excel spreadsheet files. It enables the application to export data to Excel, which is a common requirement for inventory management systems.

  6. firecrawl

    🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

    Project mention: Show HN: Get structured website data with just a prompt | news.ycombinator.com | 2025-01-20

    - Also, most of our work including /extract is open-source. Check it out here at https://github.com/mendableai/firecrawl

    That's all for now! Let us know any feedback on /extract.

  7. SWR

    React Hooks for Data Fetching

    Project mention: Using React Suspense with Data Fetching and Concurrent Rendering. | dev.to | 2025-02-12

    If you need more robust features—like caching, re-fetching, or pagination—check out libraries like React Query or SWR. Many of these tools also offer optional Suspense modes, so you can keep your data-fetching code clean while benefiting from powerful caching and refetching mechanisms.

  8. data-engineer-handbook

    This is a repo with links to everything you'd ever want to learn about data engineering

    Project mention: Data-engineer-handbook: everything to learn about data engineering | news.ycombinator.com | 2024-12-03

    This thing points to some sort of github metrics dashboard.

    The actual handbook is at: https://github.com/DataExpert-io/data-engineer-handbook

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Project mention: Show HN: Flow – A Dynamic Task Engine for AI Agents Without DAG | news.ycombinator.com | 2024-12-02

    - https://github.com/PrefectHQ/prefect

  11. pandas-ai

    Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.

    Project mention: Discover the Future: Trending GitHub Projects Revolutionizing Tech 🌟 | dev.to | 2025-02-24

    Stars: 16113 Author: sinaptik-ai Star the pandas-ai repository⭐ ---

  12. airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

    Project mention: Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead | dev.to | 2025-01-27

    Whenever we discuss event streaming, Kafka inevitably enters the conversation. As the de facto standard for event streaming, Kafka is widely used as a data pipeline to move data between systems. However, Kafka is not the only tool capable of facilitating data movement. Products like Fivetran, Airbyte, and other SaaS offerings provide user-friendly tools for data ingestion, expanding the options available to engineers.

  13. data

    Data and code behind the articles and graphics at FiveThirtyEight

    Project mention: The Birthday Paradox Experiment | news.ycombinator.com | 2024-11-22

    US birth by day is available here:

    https://github.com/fivethirtyeight/data/tree/master/births

    For the period 2000-2014 the number of births by month, divided by the number of days in a month, is:

       1  31   5 072 588.00     163 631.87

  14. Presto

    The official home of the Presto distributed SQL query engine for big data

    Project mention: Using IRIS and Presto for high-performance and scalable SQL queries | dev.to | 2025-01-19

    The rise of Big Data projects, real-time self-service analytics, online query services, and social networks, among others, have enabled scenarios for massive and high-performance data queries. In response to this challenge, MPP (massively parallel processing database) technology was created, and it quickly established itself. Among the open-source MPP options, Presto (https://prestodb.io/) is the best-known option. It originated in Facebook and was utilized for data analytics, but later became open-sourced. However, since Teradata has joined the Presto community, it offers support now.

  15. faker

    Generate massive amounts of fake data in the browser and node.js (by faker-js)

    Project mention: German Devs Strip Faker.js Creator from License – I'm Suing | news.ycombinator.com | 2025-02-22
  16. awesome-bigdata

    A curated list of awesome big data frameworks, ressources and other awesomeness.

    Project mention: Top 20 Awesome on Github | dev.to | 2024-06-12

    12. Awesome Big Data

  17. chinese-xinhua

    :orange_book: 中华新华字典数据库。包括歇后语,成语,词语,汉字。

  18. akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  19. pkl

    A configuration as code language with rich validation and tooling.

    Project mention: JSON5 – JSON for Humans | news.ycombinator.com | 2024-12-08

    When I manage a project and have the freedom to choose my configuration structure, then I always use typescript. I never understood the desire to have configuration be in ini/json/jsonnet/yaml. A strongly typed configuration with code completion seems so much more robust. Except of course your usecase is to load or change the config via an API.

    I like what apple is doing with https://pkl-lang.org/ though.

  20. prql

    PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement

    Project mention: SQL queries don't start with SELECT (2019) | news.ycombinator.com | 2025-03-14
  21. Bogus

    :card_index: A simple fake data generator for C#, F#, and VB.NET. Based on and ported from the famed faker.js.

    Project mention: Iterations | dev.to | 2024-12-22

    Code which uses Bogus NuGet package for random data.

  22. semantic-source

    Parsing, analyzing, and comparing source code across many languages

    Project mention: Nuanced: As AI writes more code, we need better tools to trust it | news.ycombinator.com | 2025-01-08

    At Nuanced, we're building tools that make AI-generated code more reliable.

    As AI writes more code, we need better tools to trust it and technologies that ensure our human understanding keeps pace with this rapid development.

    While everyone else races to ship new features with AI, we're focused on addressing gaps in AI coding tools and ensuring those features are reliable and maintainable rather than code that works today but becomes a liability tomorrow.

    We're starting with an AI-powered Python language server that makes AI-generated code more reliable by understanding your entire system—using a deeper semantic understanding of code than LLMs have today, but also artifacts outside of code such as commit histories, configs, and team patterns.

    We're a team of ex-GitHub engineers and researchers who've scaled some of the world's largest developer platforms. I'm Ayman (https://www.aymannadeem.com/about/), and before founding Nuanced, I spent seven years at GitHub where I helped build Semantic(https://github.com/github/semantic), an open-source library for parsing and analyzing code across languages—and scaled security systems to detect anomalous code patterns across millions of repositories. Our team’s deep experience in static analysis and large-scale system design shapes our approach to the AI reliability challenge today.

    We've all been on-call at 2 AM, untangling complex service dependencies, and more recently, we've seen firsthand how AI accelerates development—both the wins and the wounds.

    If you're building an AI coding tool and any of this sounds interesting to you—we should talk!

    Read more at https://nuanced.dev/blog/the-reliability-gap

  23. Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    Project mention: Wk 3 Orchestration: MLOPs with DataTalks | dev.to | 2025-02-22

    Here, we use the free Mage Ai orchestration tool.

  24. machine-learning-roadmap

    A roadmap connecting many of the most important concepts in machine learning, how to learn them and what tools to use to perform them.

  25. Tabulator

    Interactive Tables and Data Grids for JavaScript

    Project mention: Svelte Data Tables for 2024: A Comprehensive Feature Comparison | dev.to | 2024-09-25

    Tabulator

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Data discussion

Log in or Post with

Data related posts

  • DVC Extension for Visual Studio Code

    1 project | news.ycombinator.com | 20 Mar 2025
  • Show HN: Open-Source Structured Extraction with LLM Using Ollama

    1 project | news.ycombinator.com | 19 Mar 2025
  • Revolutionizing Data Apps: Build Interactive Dashboards with Just Python!

    1 project | dev.to | 19 Mar 2025
  • cue VS rcl - a user suggested alternative

    2 projects | 15 Mar 2025
  • Data Indexing and Common Challenges

    1 project | dev.to | 14 Mar 2025
  • [Open Source Project] Fresh Data For AI

    1 project | dev.to | 10 Mar 2025
  • Show HN: Fresh Data for AI

    2 projects | news.ycombinator.com | 9 Mar 2025
  • A note from our sponsor - CodeRabbit
    coderabbit.ai | 22 Mar 2025
    Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →

Index

What are some of the best open-source Data projects? This list will help you:

# Project Stars
1 TanStack Query 44,204
2 Metabase 41,301
3 llama_index 40,107
4 SheetJS js-xlsx 35,488
5 firecrawl 31,797
6 SWR 31,226
7 data-engineer-handbook 27,078
8 Prefect 18,623
9 pandas-ai 18,092
10 airbyte 17,524
11 data 17,001
12 Presto 16,272
13 faker 13,591
14 awesome-bigdata 13,504
15 chinese-xinhua 10,991
16 akshare 11,021
17 pkl 10,538
18 prql 10,139
19 Bogus 9,142
20 semantic-source 9,024
21 Mage 8,196
22 machine-learning-roadmap 7,625
23 Tabulator 7,014

Sponsored
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai

Did you know that Python is
the 2nd most popular programming language
based on number of references?