Summing columns in remote Parquet files using DuckDB

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • polars

    Dataframes powered by a multithreaded, vectorized query engine, written in Rust

  • Looks like somebody requested it after reading your TIL. https://github.com/pola-rs/polars/issues/12493#issuecomment-...

    It will be in the next release. (later today?)

  • parquet-format

    Apache Parquet

  • Right, there's all sorts of metadata and often stats included in any parquet file: https://github.com/apache/parquet-format#file-format

    The offsets of said metadata are well-defined (i.e. in the footer) so for S3 / blob storage so long as you can efficiently request a range of bytes you can pull the metadata without having to read all the data.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • quack-reduce

    A playground for running duckdb as a stateless query engine over a data lake

  • We can run a DuckDb instance (EC2/S3) closer to the data so that sorta helps too.

    What I'm really excited about using DuckDB in a similar way to map-reduce. What if there was a way to take some SQL's logical plan and turn it into a physical plan that uses compute resources from a pool serverless DuckDB instances. Starting at the leafs of the graph (physical plan) pulling data from the source (parquet), and returning their completed work up the branches, until it is completed and ready to be used as the results.

    I've seen a few examples of this already, but nothing that I would consider production ready. I have a hunch that someone is going to drop such a project on us shortly, and it's going to change a lot of things we have become use to in the data world.

    https://github.com/BauplanLabs/quack-reduce

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts