Postgres and Parquet in the Data Lke

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Build time-series-based applications quickly and at scale.
  • SonarLint - Clean code begins in your IDE with SonarLint
  • Scout APM - Truly a developer’s best friend
  • Zigi - The context switching struggle is real
  • parquet_fdw

    Parquet foreign data wrapper for PostgreSQL

  • geoparquet

    Specification for storing geospatial vector data (point, line, polygon) in Parquet

    > "Generating Parquet"

    It is also useful for moving data from Postgres to BigQuery! ( batch load )

    https://cloud.google.com/bigquery/docs/loading-data-cloud-st...

    Thanks for the "ogr2ogr" trick! :-)

    I hope the next blog post will be about GeoParquet and storing complex geometries in parquet format :-)

    https://github.com/opengeospatial/geoparquet

  • InfluxDB

    Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Data Platform where developers build real-time applications for analytics, IoT and cloud-native services in less time with less code.

  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

    I would suggest to look onto Delta Lake (https://delta.io/) - it's built on top of the Parquet, but has advantages over the plain parquet:

    - transactions - you don't get a garbage in your table if your write failed

  • cstore_fdw

    Columnar storage extension for Postgres built as a foreign data wrapper. Check out https://github.com/citusdata/citus for a modernized columnar storage implementation built as a table access method.

    Re: performance overhead, with FDWs we have to re-munge the data into PostgreSQL's internal row-oriented TupleSlot format again. Postgres also doesn't run aggregations that can take advantage of the columnar format (e.g. CPU vectorization). Citus had some experimental code to get that working [2], but that was before FDWs supported aggregation pushdown. Nowadays it might be possible to basically have an FDW that hooks into the GROUP BY execution and runs a faster version of the aggregation that's optimized for columnar storage. We have a blog post series [3] about how we added agg pushdown support to Multicorn -- similar idea.

    There's also DuckDB which obliterates both of these options when it comes to performance. In my (again limited, not very scientific) benchmarking of on a customer's 3M row table [4] (278MB in cstore_fdw, 140MB in Parquet), I see a 10-20x (1/2s -> 0.1/0.2s) speedup on some basic aggregation queries when querying a Parquet file with DuckDB as opposed to using cstore_fdw/parquet_fdw.

    I think the dream is being able to use DuckDB from within a FDW as an OLAP query engine for PostgreSQL. duckdb_fdw [5] exists, but it basically took sqlite_fdw and connected it to DuckDB's SQLite interface, which means that a lot of operations get lost in translation and aren't pushed down to DuckDB, so it's not much better than plain parquet_fdw.

    This comment is already getting too long, but FDWs can indeed participate in partitions! There's this blog post that I keep meaning to implement where the setup is, a "coordinator" PG instance has a partitioned table, where each partition is a postgres_fdw foreign table that proxies to a "data" PG instance. The "coordinator" node doesn't store any data and only gathers execution results from the "data" nodes. In the article, the "data" nodes store plain old PG tables, but I don't think there's anything preventing them from being parquet_fdw/cstore_fdw tables instead.

    [0] https://github.com/citusdata/cstore_fdw

  • odbc2parquet

    A command line tool to query an ODBC data source and write the result into a parquet file.

  • postgres_vectorization_test

    Vectorized executor to speed up PostgreSQL

  • duckdb_fdw

    DuckDB Foreign Data Wrapper for PostgreSQL

  • SonarLint

    Clean code begins in your IDE with SonarLint. Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts