Loading a trillion rows of weather data into TimescaleDB

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • arco-era5

    Recipes for reproducing Analysis-Ready & Cloud Optimized (ARCO) ERA5 datasets.

  • Why?

    Most weather and climate datasets - including ERA5 - are highly structured on regular latitude-longitude grids. Even if you were solely doing timeseries analyses for specific locations plucked from this grid, the strength of this sort of dataset is its intrinsic spatiotemporal structure and context, and it makes very little sense to completely destroy the dataset's structure unless you were solely and exclusively to extract point timeseries. And even then, you'd probably want to decimate the data pretty dramatically, since there is very little use case for, say, a point timeseries of surface temperature in the middle of the ocean!

    The vast majority of research and operational applications of datasets like ERA5 are probably better suited by leveraging cloud-optimized replicas of the original dataset, such as ARCO-ERA5 published on the Google Public Datasets program [1]. These versions of the dataset preserve the original structure, and chunk it in ways that are amenable to massively parallel access via cloud storage. In almost any case I've encountered in my career, a generically chunked Zarr-based archive of a dataset like this will be more than performant enough for the majority of use cases that one might care about.

    [1]: https://cloud.google.com/storage/docs/public-datasets/era5

  • proton

    A streaming SQL engine, a fast and lightweight alternative to ksqlDB and Apache Flink, 🚀 powered by ClickHouse. (by timeplus-io)

  • What's the process for adding support for other databases to your tool qStudio?

    I'm thinking perhaps you could add support for Timeplus [1]? Timeplus is a streaming-first database built on ClickHouse. The core DB engine Timeplus Proton is open source [2].

    It seems that qStudio is open source [3] and written in Java and will need a JDBC driver to add support for a new RDBMS? If yes, Timeplus Proton has an open source JDBC driver [4] based on ClickHouse's driver but with modifications added for streaming use cases.

    1: https://www.timeplus.com/

    2: https://github.com/timeplus-io/proton

    3: https://github.com/timeseries/qstudio

    4: https://github.com/timeplus-io/proton-java-driver

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • qstudio

    qStudio - Free SQL Analysis Tool

  • What's the process for adding support for other databases to your tool qStudio?

    I'm thinking perhaps you could add support for Timeplus [1]? Timeplus is a streaming-first database built on ClickHouse. The core DB engine Timeplus Proton is open source [2].

    It seems that qStudio is open source [3] and written in Java and will need a JDBC driver to add support for a new RDBMS? If yes, Timeplus Proton has an open source JDBC driver [4] based on ClickHouse's driver but with modifications added for streaming use cases.

    1: https://www.timeplus.com/

    2: https://github.com/timeplus-io/proton

    3: https://github.com/timeseries/qstudio

    4: https://github.com/timeplus-io/proton-java-driver

  • proton-java-driver

    JDBC driver for Timeplus Proton

  • What's the process for adding support for other databases to your tool qStudio?

    I'm thinking perhaps you could add support for Timeplus [1]? Timeplus is a streaming-first database built on ClickHouse. The core DB engine Timeplus Proton is open source [2].

    It seems that qStudio is open source [3] and written in Java and will need a JDBC driver to add support for a new RDBMS? If yes, Timeplus Proton has an open source JDBC driver [4] based on ClickHouse's driver but with modifications added for streaming use cases.

    1: https://www.timeplus.com/

    2: https://github.com/timeplus-io/proton

    3: https://github.com/timeseries/qstudio

    4: https://github.com/timeplus-io/proton-java-driver

  • ClickBench

    ClickBench: a Benchmark For Analytical Databases

  • TimescaleDB primarily serves operational use cases: Developers building products on top of live data, where you are regularly streaming in fresh data, and you often know what many queries look like a priori, because those are powering your live APIs, dashboards, and product experience.

    That's different from a data warehouse or many traditional "OLAP" use cases, where you might dump a big dataset statically, and then people will occasionally do ad-hoc queries against it. This is the big weather dataset file sitting on your desktop that you occasionally query while on holidays.

    So it's less about "can you store weather data", but what does that use case look like? How are the queries shaped? Are you saving a single dataset for ad-hoc queries across the entire dataset, or continuously streaming in new data, and aging out or de-prioritizing old data?

    In most of the products we serve, customers are often interested in recent data in a very granular format ("shallow and wide"), or longer historical queries along a well defined axis ("deep and narrow").

    For example, this is where the benefits of TimescaleDB's segmented columnar compression emerges. It optimizes for those queries which are very common in your application, e.g., an IoT application that groups by or selected by deviceID, crypto/fintech analysis based on the ticker symbol, product analytics based on tenantID, etc.

    If you look at Clickbench, what most of the queries say are: Scan ALL the data in your database, and GROUP BY one of the 100 columns in the web analytics logs.

    - https://github.com/ClickHouse/ClickBench/blob/main/clickhous...

    There are almost no time-predicates in the benchmark that Clickhouse created, but perhaps that is not surprising given it was designed for ad-hoc weblog analytics at Yandex.

    So yes, Timescale serves many products today that use weather data, but has made different choices than Clickhouse (or things like DuckDB, pg_analytics, etc) to serve those more operational use cases.

  • open-data

    Open-Meteo on AWS Open Data (by open-meteo)

  • Creator of Open-Meteo here. There is small tutorial to setup ERA5 locally: https://github.com/open-meteo/open-data/tree/main/tutorial_d...

    Under the hood Open-Meteo is using a custom file format with time-series chunking and specialised compression for low-frequency weather data. General purpose time-series databases do not even get close to this setup.

  • timescaledb-insert-benchmarks

    Benchmarking inserting a ~trillion rows of weather data into TimescaleDB

  • The full dataset is quite huge (~9 petabytes and growing) out of which I'm using just ~8 terabytes. Still quite big to upload.

    The data is freely available from the [Climate Change Service](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...) which has a nice API but download speeds can be a bit slow.

    [NCAR's Research Data Archive](https://rda.ucar.edu/datasets/ds633-0/) provides some of the data (as pre-generated NetCDF files) but at higher download speeds.

    It's not super well documented but I hosted the Python scripts I used to download the data on the accompanying GitHub repository: https://github.com/ali-ramadhan/timescaledb-insert-benchmark...

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • The full dataset is quite huge (~9 petabytes and growing) out of which I'm using just ~8 terabytes. Still quite big to upload.

    The data is freely available from the [Climate Change Service](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...) which has a nice API but download speeds can be a bit slow.

    [NCAR's Research Data Archive](https://rda.ucar.edu/datasets/ds633-0/) provides some of the data (as pre-generated NetCDF files) but at higher download speeds.

    It's not super well documented but I hosted the Python scripts I used to download the data on the accompanying GitHub repository: https://github.com/ali-ramadhan/timescaledb-insert-benchmark...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts