Summing columns in remote Parquet files using DuckDB

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

polars

144 26,043 10.0 Rust

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Looks like somebody requested it after reading your TIL. https://github.com/pola-rs/polars/issues/12493#issuecomment-...
It will be in the next release. (later today?)

parquet-format

4 1,637 7.4 Java

Apache Parquet

Right, there's all sorts of metadata and often stats included in any parquet file: https://github.com/apache/parquet-format#file-format
The offsets of said metadata are well-defined (i.e. in the footer) so for S3 / blob storage so long as you can efficiently request a range of bytes you can pull the metadata without having to read all the data.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
quack-reduce

2 121 4.8 Python

A playground for running duckdb as a stateless query engine over a data lake

We can run a DuckDb instance (EC2/S3) closer to the data so that sorta helps too.
What I'm really excited about using DuckDB in a similar way to map-reduce. What if there was a way to take some SQL's logical plan and turn it into a physical plan that uses compute resources from a pool serverless DuckDB instances. Starting at the leafs of the graph (physical plan) pulling data from the source (parquet), and returning their completed work up the branches, until it is completed and ready to be used as the results.
I've seen a few examples of this already, but nothing that I would consider production ready. I have a hunch that someone is going to drop such a project on us shortly, and it's going to change a lot of things we have become use to in the data world.
https://github.com/BauplanLabs/quack-reduce

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project