SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 data-engineering Open-Source Projects
-
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
-
-
data-engineering-zoomcamp
Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.
Project mention: Study Notes 2.2.7: Managing Schedules and Backfills with BigQuery in Kestra | dev.to | 2025-02-04DE Zoomcamp Resources: Data Engineering Zoomcamp
-
applied-ml
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
-
Project mention: Show HN: Flow – A Dynamic Task Engine for AI Agents Without DAG | news.ycombinator.com | 2024-12-02
- https://github.com/PrefectHQ/prefect
-
Taipy is a Python web framework for data-driven applications. Developers can build web applications only with Python. The simple method for building web applications is especially beneficial for Data Scientists and analytics professionals. They are gaining popularity in the field. At the time of writing this post, they have 1.9k forks and 17.6k stars.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Project mention: Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead | dev.to | 2025-01-27Whenever we discuss event streaming, Kafka inevitably enters the conversation. As the de facto standard for event streaming, Kafka is widely used as a data pipeline to move data between systems. However, Kafka is not the only tool capable of facilitating data movement. Products like Fivetran, Airbyte, and other SaaS offerings provide user-friendly tools for data ingestion, expanding the options available to engineers.
-
Project mention: Data on Kubernetes: Part 4 - Argo Workflows: Simplify parallel jobs : Container-native workflow engine for Kubernetes 🔮 | dev.to | 2024-07-28
Remember to meet the prerequisites, including AWS cli, kubectl, terraform and Argo Workflow CLI.
-
-
Data orchestration tools are key for managing data pipelines in modern workflows. When it comes to tools, Apache Airflow, Dagster, and Flyte are popular tools serving this need, but they serve different purposes and follow different philosophies. Choosing the right tool for your requirements is essential for scalability and efficiency. In this blog, I will compare Apache Airflow, Dagster, and Flyte, exploring their evolution, features, and unique strengths, while sharing insights from my hands-on experience with these tools in a weather data pipeline project.
-
-
-
-
Project mention: Kuvasz-streamer: open-source CDC for Postgres for low latency replication | news.ycombinator.com | 2025-01-03
lots of Go based CDC stuff going on these days. Redpanda Connect (formerly benthos) recently added support for Postgres CDC [1] and MySQL is coming soon too [2].
1: https://github.com/redpanda-data/connect/pull/2917
2: https://github.com/redpanda-data/connect/pull/3014
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
Here, we use the free Mage Ai orchestration tool.
-
Project mention: Simplifying SQL function implementation with Rust procedural macro | dev.to | 2025-03-13
Then, utilize declarative macros to generate various types of kernel functions, including functions with 1, 2, and 3 parameters, as well as the input/output combinations of T and Option. Common kernels like unary, binary, ternary, unary_nullable and unary_bytes are generated, partially addressing the last two issues. (For the implementation details, see RisingWave's earlier code.) Theoretically, type exercise could also be used here. For example, introducing a trait to unify (A,), (A, B) and (A, B, C), or utilizing traits of Into and AsRef to unify T, Option, and Result, etc. However, you will probably encounter some type challenges posed by rustc.
-
-
-
Project mention: 10 Must-Know Open Source Platform Engineering Tools for AI/ML Workflows | dev.to | 2025-02-06
Unlike the other tools, Feast solves a different issue: the management of ML feature data. Feast simplifies the features management by storing and managing the code used to generate machine learning features, and facilitates the deployment of these features into production. Typically, it integrates with your data sources to streamline management.
-
evidence
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
Project mention: How to Build an Internal Data Application Using Google Sheets as a Data Source | dev.to | 2025-03-13Evidence: https://evidence.dev/
-
-
sql-translator
SQL Translator is a tool for converting natural language queries into SQL code using artificial intelligence. This project is 100% free and open source.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
data-engineering discussion
data-engineering related posts
-
The DOJ Still Wants Google to Sell Off Chrome
-
Hybrid in-memory and disk cache in Rust
-
Study Notes 2.2.7: Managing Schedules and Backfills with BigQuery in Kestra
-
Study Note DE Zoomcamp 1.2.4 - Dockerizing the Ingestion Script
-
Show HN: Desbordante 2.3.0 is out, now supports macOS
-
Start contributing to a Popular Open Source Project
-
Data Engineering Zoomcamp 2025 Cohort: Introduction - Self-Study Notes
-
A note from our sponsor - SaaSHub
www.saashub.com | 22 Mar 2025
Index
What are some of the best open-source data-engineering projects? This list will help you:
# | Project | Stars |
---|---|---|
1 | superset | 64,973 |
2 | Airflow | 39,237 |
3 | Made-With-ML | 38,312 |
4 | data-engineering-zoomcamp | 29,486 |
5 | applied-ml | 27,831 |
6 | Prefect | 18,623 |
7 | Taipy | 17,890 |
8 | airbyte | 17,524 |
9 | argo | 15,457 |
10 | Cookbook | 14,134 |
11 | dagster | 12,720 |
12 | data-engineer-roadmap | 12,478 |
13 | great_expectations | 10,274 |
14 | xonsh | 8,661 |
15 | connect | 8,273 |
16 | Mage | 8,196 |
17 | risingwave | 7,534 |
18 | growthbook | 6,437 |
19 | cloudquery | 6,042 |
20 | feast | 5,870 |
21 | evidence | 4,971 |
22 | lakeFS | 4,592 |
23 | sql-translator | 4,261 |