ETL Pipelines with Airflow: The Good, the Bad and the Ugly

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

daggy

2 - -

Thanks for the feedback. I'll take a look at how Luigi models task state. Right now each TaskExecutor type is responsible for running and reporting on tasks (e.g. the Slurm executor submits jobs and monitors them for completion). I was considering adding a companion "verify" stage for every vertex, which would be a command that ran and verified output. It might be a way to do what I think you're describing above without having to build in a variety of expected outputs into the daggy core. I'll check what Luigi is doing, though.
> resuming a partially failed build
Daggy does this! Right now it will continue running the DAG until every path is completed or all vertices in a processing state (queued, running, retry, error) are in the error state, then the DAG goes to an error state.
It's possible to explicitly set task/vertex states (e.g. mark it complete if the step was manually completed), then change the DAG state to QUEUED, at which point the DAG will resume execution from where it left off. [1] is a unit test that walks through that functionality.
[1] https://gitlab.com/iroddis/daggy/-/blob/master/tests/unit_se...

materialize

117 5,567 10.0 Rust

The data warehouse for operational workloads. (by MaterializeInc)

Are you perhaps talking about something like https://materialize.com/ ? (btw, dbt now has some materialize compatibility)
Maybe Pravega and Beam working together?

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
NVTabular

1 1,006 5.5 Python

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

If you have GPUs, NVTabular outperforms most of the frameworks out there: https://github.com/NVIDIA/NVTabular

Scio

7 2,520 9.6 Scala

A Scala API for Apache Beam and Google Cloud Dataflow.

If you prefer Scala, then you can try Scio: https://github.com/spotify/scio.

dbt-expectations

10 939 6.7 Shell

Port(ish) of Great Expectations to dbt test macros

[dbt Labs employee here]
Check out dbt-expectations package[1]. It's a port of the Great Expectations checks to dbt as tests. The advantage of this is you don't need another tool for these pretty standard tests, and can be early incorporated into dbt workflows.
[1] https://github.com/calogica/dbt-expectations

cuetils

2 76 0.0 Go

CLI and library for diff, patch, and ETL operations on CUE, JSON, and Yaml

I got inspired and started this over the weekend to demonstrate what is possible.
https://github.com/hofstadter-io/cuetils

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Choosing Between a Streaming Database and a Stream Processing Framework in Python
10 projects | dev.to | 10 Feb 2024
We Built a Streaming SQL Engine
3 projects | news.ycombinator.com | 21 Oct 2023
GlareDB: An open source SQL database to query and analyze distributed data
4 projects | /r/dataengineering | 8 Jun 2023
Query Real Time Data in Kafka Using SQL
7 projects | dev.to | 23 Mar 2023
What makes a time series oriented database (ex: QuestDB) more efficient for OLAP on time series than an OLAP "only" oriented database (ex: DuckDB) technically?
1 project | /r/dataengineering | 23 Jan 2023

ETL Pipelines with Airflow: The Good, the Bad and the Ugly

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Streaming Big Data Rust Scala Database
Post date: 8 Oct 2021

daggy

materialize

InfluxDB

NVTabular

Scio

dbt-expectations

cuetils

Related posts

ETL Pipelines with Airflow: The Good, the Bad and the Ugly

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Streaming Big Data Rust Scala Database Post date: 8 Oct 2021

daggy

materialize

InfluxDB

NVTabular

Scio

dbt-expectations

cuetils

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Streaming Big Data Rust Scala Database
Post date: 8 Oct 2021