Loading a trillion rows of weather data into TimescaleDB

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  1. arco-era5

    Recipes for reproducing Analysis-Ready & Cloud Optimized (ARCO) ERA5 datasets.

    Why?

    Most weather and climate datasets - including ERA5 - are highly structured on regular latitude-longitude grids. Even if you were solely doing timeseries analyses for specific locations plucked from this grid, the strength of this sort of dataset is its intrinsic spatiotemporal structure and context, and it makes very little sense to completely destroy the dataset's structure unless you were solely and exclusively to extract point timeseries. And even then, you'd probably want to decimate the data pretty dramatically, since there is very little use case for, say, a point timeseries of surface temperature in the middle of the ocean!

    The vast majority of research and operational applications of datasets like ERA5 are probably better suited by leveraging cloud-optimized replicas of the original dataset, such as ARCO-ERA5 published on the Google Public Datasets program [1]. These versions of the dataset preserve the original structure, and chunk it in ways that are amenable to massively parallel access via cloud storage. In almost any case I've encountered in my career, a generically chunked Zarr-based archive of a dataset like this will be more than performant enough for the majority of use cases that one might care about.

    [1]: https://cloud.google.com/storage/docs/public-datasets/era5

  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. proton

    High-performance, low-footprint SQL database written in C++. Process millions of rows per second from Kafka, Pulsar, or ClickHouse, and seamlessly write results back. Supports powerful features like JOIN, CDC, UPSERT, and LOOKUP, enabling real-time analytics and ETL at scale. (by timeplus-io)

    What's the process for adding support for other databases to your tool qStudio?

    I'm thinking perhaps you could add support for Timeplus [1]? Timeplus is a streaming-first database built on ClickHouse. The core DB engine Timeplus Proton is open source [2].

    It seems that qStudio is open source [3] and written in Java and will need a JDBC driver to add support for a new RDBMS? If yes, Timeplus Proton has an open source JDBC driver [4] based on ClickHouse's driver but with modifications added for streaming use cases.

    1: https://www.timeplus.com/

    2: https://github.com/timeplus-io/proton

    3: https://github.com/timeseries/qstudio

    4: https://github.com/timeplus-io/proton-java-driver

  4. qstudio

    qStudio - Free SQL Analysis Tool

    What's the process for adding support for other databases to your tool qStudio?

    I'm thinking perhaps you could add support for Timeplus [1]? Timeplus is a streaming-first database built on ClickHouse. The core DB engine Timeplus Proton is open source [2].

    It seems that qStudio is open source [3] and written in Java and will need a JDBC driver to add support for a new RDBMS? If yes, Timeplus Proton has an open source JDBC driver [4] based on ClickHouse's driver but with modifications added for streaming use cases.

    1: https://www.timeplus.com/

    2: https://github.com/timeplus-io/proton

    3: https://github.com/timeseries/qstudio

    4: https://github.com/timeplus-io/proton-java-driver

  5. proton-java-driver

    JDBC driver for Timeplus Proton

    What's the process for adding support for other databases to your tool qStudio?

    I'm thinking perhaps you could add support for Timeplus [1]? Timeplus is a streaming-first database built on ClickHouse. The core DB engine Timeplus Proton is open source [2].

    It seems that qStudio is open source [3] and written in Java and will need a JDBC driver to add support for a new RDBMS? If yes, Timeplus Proton has an open source JDBC driver [4] based on ClickHouse's driver but with modifications added for streaming use cases.

    1: https://www.timeplus.com/

    2: https://github.com/timeplus-io/proton

    3: https://github.com/timeseries/qstudio

    4: https://github.com/timeplus-io/proton-java-driver

  6. ClickBench

    ClickBench: a Benchmark For Analytical Databases

  7. open-data

    Open-Meteo on AWS Open Data (by open-meteo)

    Creator of Open-Meteo here. There is small tutorial to setup ERA5 locally: https://github.com/open-meteo/open-data/tree/main/tutorial_d...

    Under the hood Open-Meteo is using a custom file format with time-series chunking and specialised compression for low-frequency weather data. General purpose time-series databases do not even get close to this setup.

  8. timescaledb-insert-benchmarks

    Benchmarking inserting a ~trillion rows of weather data into TimescaleDB

    The full dataset is quite huge (~9 petabytes and growing) out of which I'm using just ~8 terabytes. Still quite big to upload.

    The data is freely available from the [Climate Change Service](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...) which has a nice API but download speeds can be a bit slow.

    [NCAR's Research Data Archive](https://rda.ucar.edu/datasets/ds633-0/) provides some of the data (as pre-generated NetCDF files) but at higher download speeds.

    It's not super well documented but I hosted the Python scripts I used to download the data on the accompanying GitHub repository: https://github.com/ali-ramadhan/timescaledb-insert-benchmark...

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. timescaledb-insert-benchmark

    Discontinued [GET https://api.github.com/repos/ali-ramadhan/timescaledb-insert-benchmark: 404 - Not Found // See: https://docs.github.com/rest/repos/repos#get-a-repository]

    The full dataset is quite huge (~9 petabytes and growing) out of which I'm using just ~8 terabytes. Still quite big to upload.

    The data is freely available from the [Climate Change Service](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...) which has a nice API but download speeds can be a bit slow.

    [NCAR's Research Data Archive](https://rda.ucar.edu/datasets/ds633-0/) provides some of the data (as pre-generated NetCDF files) but at higher download speeds.

    It's not super well documented but I hosted the Python scripts I used to download the data on the accompanying GitHub repository: https://github.com/ali-ramadhan/timescaledb-insert-benchmark...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

Did you know that Python is
the 2nd most popular programming language
based on number of references?