Query Engines: Push vs. Pull

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • duckdb

    DuckDB is an in-process SQL OLAP Database Management System

  • The DuckDB folks are migrating from pull to push and put together this interesting documentation of their reasoning! They use a vectorized model instead of a compiled one, so it's another interesting comparison point. It seems like push will simplify how they handle parallelism.

    https://github.com/duckdb/duckdb/issues/1583

  • octosql

    OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • rust

    Empowering everyone to build reliable and efficient software.

  • I like the use of generator functions for the pull-based example- the syntax closely mirrors the push-based example, which drives home the point that the difference between push and pull lies entirely in whether you compose operators via an iterator API or a callback API. You can also see this in how the two usage examples are inside-out versions of each other.

    When you look at it from the perspective of a language designer or compiler, things start to look even more similar:

    * Values that live across iterations are stored in a closure (push) or generator object (pull)

    * In the first half of an iteration, before producing a result, operators dispatch to their producers via a return address on the call stack (push) or callbacks (pull)

    * In the second half of an iteration, while building up a result, operators dispatch to their consumers via callbacks (push) or a return address on the call stack (pull)

    The difference the article mentions between "unrolling" the two approaches comes from the level of abstraction where inlining is performed. The push model produces natural-looking output when inlining the closures. The less-natural result of inlining the pull model's `next` functions is where the control flow in loops shows up. (See for example the performance benefits of adding a push-model complement to Rust's usual pull-model Iterator trait: https://github.com/rust-lang/rust/pull/42782#issuecomment-30...)

    But, you can get the same natural-looking output from the pull model if you perform inlining at the level of generators! This is a less familiar transformation than function inlining, but it's still fundamentally the same idea, of erasing API boundaries and ABI overhead- which in this case is a more straightforward formulation of the various loop optimizations that can sometimes get rid of the control-flow-in-loops produced by inlining `next` functions.

    The difference around DAGs is similar. In the push model, a DAG just means a producer calling into more than one consumer callback. In the pull model, this is tricky because you can't just yield (i.e. return from `next`) to more than one place- return addresses form a stack! (And thus the dual problem for the push model shows up for operations with multiple inputs, like the merge join mentioned in the article.)

    Overall I think a more flexible approach might be to generalize from `yield` and iterator APIs, to algebraic effects (or, lower level, delimited continuations)- these let you build both push and pull-like APIs for a single generator-like function, as well as hybrid approaches that combine aspects of both, by removing the requirement for a single privileged call stack.

  • proposal-observable

    Observables for ECMAScript

  • Rx.NET

    The Reactive Extensions for .NET

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts