How we made data aggregation on PostgreSQL better and faster

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • ClickHouse

    ClickHouse® is a free analytics DBMS for big data

  • ClickHouse [1] is a "fast open source (almost)SQL analytics database" you are looking for :)

    [1] https://clickhouse.com/

  • materialize

    The data warehouse for operational workloads. (by MaterializeInc)

  • https://materialize.com/ is billed to be that. That behavior is not trivial to implement.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • pg_ivm

    IVM (Incremental View Maintenance) implementation as a PostgreSQL extension

  • Interesting. I check back on the wiki page every couple of years to see what progress is being made.

    The basic idea seems to be to track the primary keys of the base tables in the incremental view and then use triggers to update those rows when source rows are updated.

    The meat of the project is over here for anyone that’s interested - in particular this section about the limitations is pretty interesting (and expected).

    https://github.com/sraoss/pg_ivm/#supported-view-definitions...

  • TimescaleDB

    An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.

  • (NB - blog author/Timescale employee)

    One thing we're improving as we move forward in documentation and other areas is explaining why doing joins (and things like window functions) is difficult in continuous aggregates and not the current focus. Honestly, it's part of the reason most databases haven't tackled this problem before.

    Once you add in joins or things that might refer to data outside of the refresh window (LAG values for example), things get really complicated. For instance, if you join to a dimension table and a piece of metadata changes, does that change now need to be updated and reflected back in all of this historical aggregate data that's outside of the automatic refresh policy? Same with a window function - if data within a window hasn't changed but data that *might* be hit because of the window function reference does change, continuous aggregates would have to know about that for each query and track those changes too.

    I'm not saying it's impossible or that it won't be solved someday, but the functionality with continuous aggregates that keeps the aggregate data updated automatically (without losing any history) *and* being able to perform fast joins on the finalized data is a very useful step that's not available anywhere else within the Postgres ecosystem.

    RE: CAGG on top of a CAGG - you're certainly not the only person to request this[1] () and we understand that. Part of this is because of what I discussed above (tracking changes across multiple tables), although having finalized data might make this more possible in the future.

    That said (!!!), the cool thing is that we already *have* begun to solve this problem with hyperfunction aggregates and 2-step aggregation, something I showed in the blog post. So, if your dataset can benefit from one of the hyperfunction aggregates that we currently provide, there are lots of cool things you can do with it, including rollups into bigger buckets without creating a second continuous aggregate! If you haven't checked them out, please do! [2][3]

    [1]: https://github.com/timescale/timescaledb/issues/1400

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • GlareDB: An analytics DBMS for distributed data

    1 project | news.ycombinator.com | 20 Sep 2023
  • GlareDB – Your Data Pipeline, Simplified

    1 project | news.ycombinator.com | 19 Sep 2023
  • Querying Postgres Tables Directly from DuckDB

    6 projects | news.ycombinator.com | 30 Sep 2022
  • TimescaleDB 2.7 vs. PostgreSQL 14

    4 projects | news.ycombinator.com | 22 Sep 2022
  • I'm building a real-time monitoring system for metrics of servers in Rust !

    2 projects | /r/rust | 21 Jun 2021