data-engineering

Open-source projects categorized as data-engineering

Top 23 data-engineering Open-Source Projects

data-engineering
  1. superset

    Apache Superset is a Data Visualization and Data Exploration Platform

    Project mention: The DOJ Still Wants Google to Sell Off Chrome | news.ycombinator.com | 2025-03-08
  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Project mention: The DOJ Still Wants Google to Sell Off Chrome | news.ycombinator.com | 2025-03-08
  4. Made-With-ML

    Learn how to design, develop, deploy and iterate on production-grade ML applications.

  5. data-engineering-zoomcamp

    Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.

    Project mention: Study Notes 2.2.7: Managing Schedules and Backfills with BigQuery in Kestra | dev.to | 2025-02-04

    DE Zoomcamp Resources: Data Engineering Zoomcamp

  6. applied-ml

    📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.

  7. Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Project mention: Show HN: Flow – A Dynamic Task Engine for AI Agents Without DAG | news.ycombinator.com | 2024-12-02

    - https://github.com/PrefectHQ/prefect

  8. Taipy

    Turns Data and AI algorithms into production-ready web applications in no time.

    Project mention: Start contributing to a Popular Open Source Project | dev.to | 2025-01-28

    Taipy is a Python web framework for data-driven applications. Developers can build web applications only with Python. The simple method for building web applications is especially beneficial for Data Scientists and analytics professionals. They are gaining popularity in the field. At the time of writing this post, they have 1.9k forks and 17.6k stars.

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

    Project mention: Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead | dev.to | 2025-01-27

    Whenever we discuss event streaming, Kafka inevitably enters the conversation. As the de facto standard for event streaming, Kafka is widely used as a data pipeline to move data between systems. However, Kafka is not the only tool capable of facilitating data movement. Products like Fivetran, Airbyte, and other SaaS offerings provide user-friendly tools for data ingestion, expanding the options available to engineers.

  11. argo

    Workflow Engine for Kubernetes

    Project mention: Data on Kubernetes: Part 4 - Argo Workflows: Simplify parallel jobs : Container-native workflow engine for Kubernetes 🔮 | dev.to | 2024-07-28

    Remember to meet the prerequisites, including AWS cli, kubectl, terraform and Argo Workflow CLI.

  12. Cookbook

    The Data Engineering Cookbook

  13. dagster

    An orchestration platform for the development, production, and observation of data assets.

    Project mention: Data Orchestration Tool Analysis: Airflow, Dagster, Flyte | dev.to | 2025-01-23

    Data orchestration tools are key for managing data pipelines in modern workflows. When it comes to tools, Apache Airflow, Dagster, and Flyte are popular tools serving this need, but they serve different purposes and follow different philosophies. Choosing the right tool for your requirements is essential for scalability and efficiency. In this blog, I will compare Apache Airflow, Dagster, and Flyte, exploring their evolution, features, and unique strengths, while sharing insights from my hands-on experience with these tools in a weather data pipeline project.

  14. data-engineer-roadmap

    Roadmap to becoming a data engineer in 2021

  15. great_expectations

    Always know what to expect from your data.

  16. xonsh

    :shell: Python-powered shell. Full-featured and cross-platform.

    Project mention: Xonsh – A Python-Powered Shell | news.ycombinator.com | 2025-02-21
  17. connect

    Fancy stream processing made operationally mundane (by redpanda-data)

    Project mention: Kuvasz-streamer: open-source CDC for Postgres for low latency replication | news.ycombinator.com | 2025-01-03

    lots of Go based CDC stuff going on these days. Redpanda Connect (formerly benthos) recently added support for Postgres CDC [1] and MySQL is coming soon too [2].

    1: https://github.com/redpanda-data/connect/pull/2917

    2: https://github.com/redpanda-data/connect/pull/3014

  18. Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    Project mention: Wk 3 Orchestration: MLOPs with DataTalks | dev.to | 2025-02-22

    Here, we use the free Mage Ai orchestration tool.

  19. risingwave

    Stream processing and management platform.

    Project mention: Simplifying SQL function implementation with Rust procedural macro | dev.to | 2025-03-13

    Then, utilize declarative macros to generate various types of kernel functions, including functions with 1, 2, and 3 parameters, as well as the input/output combinations of T and Option. Common kernels like unary, binary, ternary, unary_nullable and unary_bytes are generated, partially addressing the last two issues. (For the implementation details, see RisingWave's earlier code.) Theoretically, type exercise could also be used here. For example, introducing a trait to unify (A,), (A, B) and (A, B, C), or utilizing traits of Into and AsRef to unify T, Option, and Result, etc. However, you will probably encounter some type challenges posed by rustc.

  20. growthbook

    Open Source Feature Flagging and A/B Testing Platform

  21. cloudquery

    The developer first cloud governance platform

  22. feast

    The Open Source Feature Store for AI/ML

    Project mention: 10 Must-Know Open Source Platform Engineering Tools for AI/ML Workflows | dev.to | 2025-02-06

    Unlike the other tools, Feast solves a different issue: the management of ML feature data. Feast simplifies the features management by storing and managing the code used to generate machine learning features, and facilitates the deployment of these features into production. Typically, it integrates with your data sources to streamline management.

  23. evidence

    Business intelligence as code: build fast, interactive data visualizations in SQL and markdown

    Project mention: How to Build an Internal Data Application Using Google Sheets as a Data Source | dev.to | 2025-03-13

    Evidence: https://evidence.dev/

  24. lakeFS

    lakeFS - Data version control for your data lake | Git for data

  25. sql-translator

    SQL Translator is a tool for converting natural language queries into SQL code using artificial intelligence. This project is 100% free and open source.

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

data-engineering discussion

Log in or Post with

data-engineering related posts

  • The DOJ Still Wants Google to Sell Off Chrome

    4 projects | news.ycombinator.com | 8 Mar 2025
  • Hybrid in-memory and disk cache in Rust

    2 projects | news.ycombinator.com | 5 Mar 2025
  • Study Notes 2.2.7: Managing Schedules and Backfills with BigQuery in Kestra

    3 projects | dev.to | 4 Feb 2025
  • Study Note DE Zoomcamp 1.2.4 - Dockerizing the Ingestion Script

    1 project | dev.to | 4 Feb 2025
  • Show HN: Desbordante 2.3.0 is out, now supports macOS

    2 projects | news.ycombinator.com | 4 Feb 2025
  • Start contributing to a Popular Open Source Project

    2 projects | dev.to | 28 Jan 2025
  • Data Engineering Zoomcamp 2025 Cohort: Introduction - Self-Study Notes

    1 project | dev.to | 25 Jan 2025
  • A note from our sponsor - SaaSHub
    www.saashub.com | 22 Mar 2025
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source data-engineering projects? This list will help you:

# Project Stars
1 superset 64,973
2 Airflow 39,237
3 Made-With-ML 38,312
4 data-engineering-zoomcamp 29,486
5 applied-ml 27,831
6 Prefect 18,623
7 Taipy 17,890
8 airbyte 17,524
9 argo 15,457
10 Cookbook 14,134
11 dagster 12,720
12 data-engineer-roadmap 12,478
13 great_expectations 10,274
14 xonsh 8,661
15 connect 8,273
16 Mage 8,196
17 risingwave 7,534
18 growthbook 6,437
19 cloudquery 6,042
20 feast 5,870
21 evidence 4,971
22 lakeFS 4,592
23 sql-translator 4,261

Sponsored
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai

Did you know that Python is
the 2nd most popular programming language
based on number of references?