Airflow's Problem

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • Taskflow

    A General-purpose Parallel and Heterogeneous Task Programming System

  • typhoon-orchestrator

    Create elegant data pipelines and deploy to AWS Lambda or Airflow

  • I have my own opinion on Airflow's pain points and created Typhoon Orchestrator (https://github.com/typhoon-data-org/typhoon-orchestrator) to solve them. It doesn't have many stars yet but I've used it to create some pipelines for medium sized companies in a few days, and they've been running for over a year without issues.

    In particular I transpile to Airflow code (can also deploy to Lambda) because I think it's still the most robust and well supported "runtime", I just don't think the developer experience is that good.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • pachyderm

    Data-Centric Pipelines and Data Versioning

  • I was at Airbnb when we open-sourced Airflow, it was a great solution to the problems we had at the time. It's amazing how many more use cases people have found for it since then. At the time it was pretty focused on solving our problem of orchestrating a largely static DAG of SQL jobs. It could do other stuff even then, but that was mostly what we were using it for. Airflow has become a victim of its success as it's expanded to meet every problem which could ever be considered a data workflow. The flaws and horror stories in the post and comments here definitely resonate with me. Around the time Airflow was opensource I starting working on data-centric approach to workflow management called Pachyderm[0]. By data-centric I mean that it's focused around the data itself, and its storage, versioning, orchestration and lineage. This leads to a system that feels radically different from a job focused system like Airflow. In a data-centric system your spaghetti nest of DAGs is greatly simplified as the data itself is used to describe most of the complexity. The benefit is that data is a lot simpler to reason about, it's not a living thing that needs to run in a certain way, it just exists, and because it's versioned you have strong guarantees about how it can change.

    [0] https://github.com/pachyderm/pachyderm

  • kestra

    Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.

  • But I totally agree that a large static dag is not appropriate in the actual data world with data mesh and domain responsibility.

    [0] https://github.com/kestra-io/kestra

  • orchest

    Build data pipelines, the easy way 🛠️

  • Argo is pretty amazing if you want to take advantage of the work Kubernetes has done to scale resource efficiently across a cluster of compute nodes.

    If you’re looking for something that’s a bit more high level and friendly to expose directly to your data team (data scientists/data engineers/data analysts) you can check out https://github.com/orchest/orchest

    You can think of it as a browser UI/workbench for Argo scheduled pipelines. Disclaimer: author of the project

  • flyte

    Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.

  • Some of these were the core problems that we wanted to address as part of https://flyte.org. We started with a team first and multi-tenant approach at the core. For example, each team can have separate IAM roles, secrets are restricted to teams, tasks and workflows are shareable across teams, without making libraries. and it is possible to trigger workflows across teams.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts