Postgres and Parquet in the Data Lke

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • parquet_fdw

    Parquet foreign data wrapper for PostgreSQL

  • geoparquet

    Specification for storing geospatial vector data (point, line, polygon) in Parquet

  • > "Generating Parquet"

    It is also useful for moving data from Postgres to BigQuery! ( batch load )

    https://cloud.google.com/bigquery/docs/loading-data-cloud-st...

    Thanks for the "ogr2ogr" trick! :-)

    I hope the next blog post will be about GeoParquet and storing complex geometries in parquet format :-)

    https://github.com/opengeospatial/geoparquet

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

  • I would suggest to look onto Delta Lake (https://delta.io/) - it's built on top of the Parquet, but has advantages over the plain parquet:

    - transactions - you don't get a garbage in your table if your write failed

  • cstore_fdw

    Columnar storage extension for Postgres built as a foreign data wrapper. Check out https://github.com/citusdata/citus for a modernized columnar storage implementation built as a table access method.

  • Re: performance overhead, with FDWs we have to re-munge the data into PostgreSQL's internal row-oriented TupleSlot format again. Postgres also doesn't run aggregations that can take advantage of the columnar format (e.g. CPU vectorization). Citus had some experimental code to get that working [2], but that was before FDWs supported aggregation pushdown. Nowadays it might be possible to basically have an FDW that hooks into the GROUP BY execution and runs a faster version of the aggregation that's optimized for columnar storage. We have a blog post series [3] about how we added agg pushdown support to Multicorn -- similar idea.

    There's also DuckDB which obliterates both of these options when it comes to performance. In my (again limited, not very scientific) benchmarking of on a customer's 3M row table [4] (278MB in cstore_fdw, 140MB in Parquet), I see a 10-20x (1/2s -> 0.1/0.2s) speedup on some basic aggregation queries when querying a Parquet file with DuckDB as opposed to using cstore_fdw/parquet_fdw.

    I think the dream is being able to use DuckDB from within a FDW as an OLAP query engine for PostgreSQL. duckdb_fdw [5] exists, but it basically took sqlite_fdw and connected it to DuckDB's SQLite interface, which means that a lot of operations get lost in translation and aren't pushed down to DuckDB, so it's not much better than plain parquet_fdw.

    This comment is already getting too long, but FDWs can indeed participate in partitions! There's this blog post that I keep meaning to implement where the setup is, a "coordinator" PG instance has a partitioned table, where each partition is a postgres_fdw foreign table that proxies to a "data" PG instance. The "coordinator" node doesn't store any data and only gathers execution results from the "data" nodes. In the article, the "data" nodes store plain old PG tables, but I don't think there's anything preventing them from being parquet_fdw/cstore_fdw tables instead.

    [0] https://github.com/citusdata/cstore_fdw

  • odbc2parquet

    A command line tool to query an ODBC data source and write the result into a parquet file.

  • postgres_vectorization_test

    Vectorized executor to speed up PostgreSQL

  • duckdb_fdw

    DuckDB Foreign Data Wrapper for PostgreSQL

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Show HN: Text-to-SQL Focus on Semantics and UI/UX

    1 project | news.ycombinator.com | 8 May 2024
  • WrenAI: Make Your Database RAG-Ready. Text-to-SQL Focus on Semantics and UI/UX

    1 project | news.ycombinator.com | 8 May 2024
  • Make your database RAG-ready

    1 project | news.ycombinator.com | 4 May 2024
  • WrenAI: Open-Source Natural Language Interface to Your Data

    1 project | news.ycombinator.com | 25 Apr 2024
  • DuckDB: Querying JSON files as if they were tables

    9 projects | news.ycombinator.com | 3 Mar 2023