Lessons Learned from Running Apache Airflow at Scale

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • windmill

    Open-source developer platform to turn scripts into workflows and UIs. Fastest workflow engine (5x vs Airflow). Open-source alternative to Airplane and Retool.

  • shameless plug: I am building such system where the modules are code (typescript-deno or python) but the orchestration is no code (flows). It is fully OSS: https://github.com/windmill-labs/windmill

  • orchest

    Build data pipelines, the easy way 🛠️

  • We kept hearing this from our users. We’ve just released our k8s operator based deployment of Orchest that should give you a good experience running an orchestration tool on k8s without much trouble.

    https://github.com/orchest/orchest

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • cronitor-airflow

    Cronitor integration for Airflow

  • Is anybody out there doing anything interesting with Airflow monitoring?

    At my startup Cronitor we have an Airflow sdk* that makes it pretty easy to provision monitoring for each DAG, but essentially we are only monitoring that a DAG started on time and the total time taken.

    * https://github.com/cronitorio/cronitor-airflow

  • states-language-cadence

    States Language on Cadence

  • https://github.com/checkr/states-language-cadence allows you to define workflows in states language over cadence.

  • ploomber

    The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

  • When it comes to scale and DS work I'd use the ploomber open-source (https://github.com/ploomber/ploomber). It allows an easy transition between dev and production, incrementally building the DAG so you avoid expensive compute time and costs. It's easier to maintain and integrates seamlessly with Airflow, generating the DAGs for you.

  • dbt-core

    dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

  • dbt has just opened a serious conversation about supporting Python models. I'm sure they'd value your viewpoint! https://github.com/dbt-labs/dbt-core/discussions/5261

  • luigi

    Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

  • What are you trying to do? Distributed scheduler with a single instance? No database? Are you sure you don't just mean "a scheduler" ala Luigi? https://github.com/spotify/luigi

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • magniv-core

    Magniv Core - A Python-decorator based job orchestration platform. Avoid responsibility handoffs by abstracting infra and DevOps.

  • We at magniv.io are building an alternative.

    Our core is open source https://github.com/MagnivOrg/magniv-core

    We can set you up with our hosted if you would like to poke around!

  • proposals

    Temporal proposals (by temporalio)

  • You're probably thinking of Temporal (https://temporal.io/), which is a fork of the Cadence project originally developed at Uber.

  • toil

    A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.

  • stepwise

    Clojure AWS Step Functions library

  • I feel you. That's why we wrote a little library on top of SFN so that we can program SFN with Clojure instead of YAML https://github.com/Motiva-AI/stepwise. Application code sits with SFN definition and SFN Tasks are automatically integrated as polling Activities from Clojure code.

    Thoughtworks made a case for this distinction in https://martinfowler.com/articles/cant-buy-integration.html#...

  • direktiv

    Serverless Container Orchestration

  • So being completely transparent, we're the creators of Direktiv (https://github.com/direktiv/direktiv). We're genuinely curious to have users who have previously used Airflow and other DAGs (mentioned in here is Argo workflows) try Direktiv and give us more feedback.

    - direktiv runs containers as part of workflows from any compliant container registry, passing JSON structured data between workflow states.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts