Computation reuse via fusion in Amazon Athena

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

Apache Flink

9 23,128 9.9 Java

Apache Flink

It took me some time to get a good grasp of the power of SQL; and it really kicked in when I learned about optimization rules. It's a program that you rewrite, just like an optimizing compiler would.
You state what you want; you have different ways to fetch and match and massage data; and you can search through this space to produce a physical plan. Hopefully you used knowledge to weight parts to be optimized (table statistics, like Java's JIT would detect hot spots).
I find it fascinating to peer through database code to see what is going on. Lately, there's been new advances towards streaming databases, which bring a whole new design space. For example, now you have latency of individual new rows to optimize for, as opposed to batch it whole to optimize the latency of a dataset. Batch scanning will be benefit from better use of your CPU caches.
And maybe you could have a hybrid system which reads history from a log and aggregates in a batched manner, and then switches to another execution plan when it reaches the end of the log.
If you want to have a peek at that here are Flink's set of rules [1], generic and stream-specific ones. The names can be cryptic, but usually give a good sense of what is going on. For example: PushFilterIntoTableSourceScanRule makes the WHERE clause apply the earliest possible, to save some CPU/network bandwidth further down. PushPartitionIntoTableSourceScanRule tries to make a fan-out/shuffle happen the earliest possible, so that parallelism can be made use of.
[1] https://github.com/apache/flink/blob/5f8fb304fb5d68cdb0b3e3c...

athena-paper-calcite-sandbox

1 0 0.0 Java

This is really neat, but I'm curious how much of this is novel?
I don't know much about relational algebra and query planning/optimization. But it seemed like many of these things were "no-brainers" (completely duplicate query conditions, etc)
I took one of the queries from the paper (the first one), and used Apache Calcite to run the same query, on the same schema
https://github.com/GavinRay97/athena-paper-calcite-sandbox/b...
Looking at its original plan, vs best plan, it seems like the same sorts of effects are achieved:
Here is the original plan for Query 1, you can see the aggregate expression is duplicated (Ctrl + F for "STORE_SALES")
    LogicalSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC], fetch=[100])

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project