Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →
Top 23 Data Open-Source Projects
-
TanStack Query
🤖 Powerful asynchronous state management, server-state utilities and data fetching for the web. TS/JS, React Query, Solid Query, Svelte Query and Vue Query.
React Query Documentation
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
Metabase
The easy-to-use open source Business Intelligence and Embedded Analytics tool that lets everyone work with data :bar_chart:
Hope you enjoy the release. If you want to get into the nitty-gritty, check out our release notes in GitHub. To see what other features we have in the works, see our product roadmap.
-
Project mention: Quick tip: Replace MongoDB® Atlas with SingleStore Kai in LlamaIndex | dev.to | 2025-01-21
The notebook is adapted from the LlamaIndex GitHub repo.
-
SheetJS js-xlsx
📗 SheetJS Spreadsheet Data Toolkit -- New home https://git.sheetjs.com/SheetJS/sheetjs
Project mention: Building an inventory management app: 'Invento' as a Beginner Developer | dev.to | 2024-07-24XLSX : XLSX is a library for parsing and writing Excel spreadsheet files. It enables the application to export data to Excel, which is a common requirement for inventory management systems.
-
firecrawl
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Project mention: Show HN: Get structured website data with just a prompt | news.ycombinator.com | 2025-01-20- Also, most of our work including /extract is open-source. Check it out here at https://github.com/mendableai/firecrawl
That's all for now! Let us know any feedback on /extract.
-
Project mention: Using React Suspense with Data Fetching and Concurrent Rendering. | dev.to | 2025-02-12
If you need more robust features—like caching, re-fetching, or pagination—check out libraries like React Query or SWR. Many of these tools also offer optional Suspense modes, so you can keep your data-fetching code clean while benefiting from powerful caching and refetching mechanisms.
-
data-engineer-handbook
This is a repo with links to everything you'd ever want to learn about data engineering
Project mention: Data-engineer-handbook: everything to learn about data engineering | news.ycombinator.com | 2024-12-03This thing points to some sort of github metrics dashboard.
The actual handbook is at: https://github.com/DataExpert-io/data-engineer-handbook
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
Project mention: Show HN: Flow – A Dynamic Task Engine for AI Agents Without DAG | news.ycombinator.com | 2024-12-02
- https://github.com/PrefectHQ/prefect
-
pandas-ai
Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.
Project mention: Discover the Future: Trending GitHub Projects Revolutionizing Tech 🌟 | dev.to | 2025-02-24Stars: 16113 Author: sinaptik-ai Star the pandas-ai repository⭐ ---
-
airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Project mention: Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead | dev.to | 2025-01-27Whenever we discuss event streaming, Kafka inevitably enters the conversation. As the de facto standard for event streaming, Kafka is widely used as a data pipeline to move data between systems. However, Kafka is not the only tool capable of facilitating data movement. Products like Fivetran, Airbyte, and other SaaS offerings provide user-friendly tools for data ingestion, expanding the options available to engineers.
-
US birth by day is available here:
https://github.com/fivethirtyeight/data/tree/master/births
For the period 2000-2014 the number of births by month, divided by the number of days in a month, is:
1 31 5 072 588.00 163 631.87
-
Project mention: Using IRIS and Presto for high-performance and scalable SQL queries | dev.to | 2025-01-19
The rise of Big Data projects, real-time self-service analytics, online query services, and social networks, among others, have enabled scenarios for massive and high-performance data queries. In response to this challenge, MPP (massively parallel processing database) technology was created, and it quickly established itself. Among the open-source MPP options, Presto (https://prestodb.io/) is the best-known option. It originated in Facebook and was utilized for data analytics, but later became open-sourced. However, since Teradata has joined the Presto community, it offers support now.
-
Project mention: German Devs Strip Faker.js Creator from License – I'm Suing | news.ycombinator.com | 2025-02-22
-
12. Awesome Big Data
-
-
When I manage a project and have the freedom to choose my configuration structure, then I always use typescript. I never understood the desire to have configuration be in ini/json/jsonnet/yaml. A strongly typed configuration with code completion seems so much more robust. Except of course your usecase is to load or change the config via an API.
I like what apple is doing with https://pkl-lang.org/ though.
-
prql
PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
-
Bogus
:card_index: A simple fake data generator for C#, F#, and VB.NET. Based on and ported from the famed faker.js.
Code which uses Bogus NuGet package for random data.
-
Project mention: Nuanced: As AI writes more code, we need better tools to trust it | news.ycombinator.com | 2025-01-08
At Nuanced, we're building tools that make AI-generated code more reliable.
As AI writes more code, we need better tools to trust it and technologies that ensure our human understanding keeps pace with this rapid development.
While everyone else races to ship new features with AI, we're focused on addressing gaps in AI coding tools and ensuring those features are reliable and maintainable rather than code that works today but becomes a liability tomorrow.
We're starting with an AI-powered Python language server that makes AI-generated code more reliable by understanding your entire system—using a deeper semantic understanding of code than LLMs have today, but also artifacts outside of code such as commit histories, configs, and team patterns.
We're a team of ex-GitHub engineers and researchers who've scaled some of the world's largest developer platforms. I'm Ayman (https://www.aymannadeem.com/about/), and before founding Nuanced, I spent seven years at GitHub where I helped build Semantic(https://github.com/github/semantic), an open-source library for parsing and analyzing code across languages—and scaled security systems to detect anomalous code patterns across millions of repositories. Our team’s deep experience in static analysis and large-scale system design shapes our approach to the AI reliability challenge today.
We've all been on-call at 2 AM, untangling complex service dependencies, and more recently, we've seen firsthand how AI accelerates development—both the wins and the wounds.
If you're building an AI coding tool and any of this sounds interesting to you—we should talk!
Read more at https://nuanced.dev/blog/the-reliability-gap
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
Here, we use the free Mage Ai orchestration tool.
-
machine-learning-roadmap
A roadmap connecting many of the most important concepts in machine learning, how to learn them and what tools to use to perform them.
-
Project mention: Svelte Data Tables for 2024: A Comprehensive Feature Comparison | dev.to | 2024-09-25
Tabulator
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Data discussion
Data related posts
-
DVC Extension for Visual Studio Code
-
Show HN: Open-Source Structured Extraction with LLM Using Ollama
-
Revolutionizing Data Apps: Build Interactive Dashboards with Just Python!
-
cue VS rcl - a user suggested alternative
2 projects | 15 Mar 2025 -
Data Indexing and Common Challenges
-
[Open Source Project] Fresh Data For AI
-
Show HN: Fresh Data for AI
-
A note from our sponsor - CodeRabbit
coderabbit.ai | 22 Mar 2025
Index
What are some of the best open-source Data projects? This list will help you:
# | Project | Stars |
---|---|---|
1 | TanStack Query | 44,204 |
2 | Metabase | 41,301 |
3 | llama_index | 40,107 |
4 | SheetJS js-xlsx | 35,488 |
5 | firecrawl | 31,797 |
6 | SWR | 31,226 |
7 | data-engineer-handbook | 27,078 |
8 | Prefect | 18,623 |
9 | pandas-ai | 18,092 |
10 | airbyte | 17,524 |
11 | data | 17,001 |
12 | Presto | 16,272 |
13 | faker | 13,591 |
14 | awesome-bigdata | 13,504 |
15 | chinese-xinhua | 10,991 |
16 | akshare | 11,021 |
17 | pkl | 10,538 |
18 | prql | 10,139 |
19 | Bogus | 9,142 |
20 | semantic-source | 9,024 |
21 | Mage | 8,196 |
22 | machine-learning-roadmap | 7,625 |
23 | Tabulator | 7,014 |