timescaledb-insert-benchmarks
arco-era5
timescaledb-insert-benchmarks | arco-era5 | |
---|---|---|
1 | 5 | |
14 | 179 | |
- | 6.7% | |
8.8 | 5.9 | |
about 2 months ago | 21 days ago | |
Python | Python | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
timescaledb-insert-benchmarks
-
Loading a trillion rows of weather data into TimescaleDB
The full dataset is quite huge (~9 petabytes and growing) out of which I'm using just ~8 terabytes. Still quite big to upload.
The data is freely available from the [Climate Change Service](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...) which has a nice API but download speeds can be a bit slow.
[NCAR's Research Data Archive](https://rda.ucar.edu/datasets/ds633-0/) provides some of the data (as pre-generated NetCDF files) but at higher download speeds.
It's not super well documented but I hosted the Python scripts I used to download the data on the accompanying GitHub repository: https://github.com/ali-ramadhan/timescaledb-insert-benchmark...
arco-era5
-
Loading a trillion rows of weather data into TimescaleDB
Why?
Most weather and climate datasets - including ERA5 - are highly structured on regular latitude-longitude grids. Even if you were solely doing timeseries analyses for specific locations plucked from this grid, the strength of this sort of dataset is its intrinsic spatiotemporal structure and context, and it makes very little sense to completely destroy the dataset's structure unless you were solely and exclusively to extract point timeseries. And even then, you'd probably want to decimate the data pretty dramatically, since there is very little use case for, say, a point timeseries of surface temperature in the middle of the ocean!
The vast majority of research and operational applications of datasets like ERA5 are probably better suited by leveraging cloud-optimized replicas of the original dataset, such as ARCO-ERA5 published on the Google Public Datasets program [1]. These versions of the dataset preserve the original structure, and chunk it in ways that are amenable to massively parallel access via cloud storage. In almost any case I've encountered in my career, a generically chunked Zarr-based archive of a dataset like this will be more than performant enough for the majority of use cases that one might care about.
[1]: https://cloud.google.com/storage/docs/public-datasets/era5
-
GraphCast: AI model for faster and more accurate global weather forecasting
You can get some of the historical data also from here: https://cloud.google.com/storage/docs/public-datasets/era5 (if the official API is too slow. )
To use the data in live fashion I think you would need to get license from ECMWF...
-
Open-source could finally get the world’s microscopes speaking the same language
This article misses one of the coolest things about the Zarr format - that it's flexible enough that it's also becoming widely used in climate science.
In particular the Pangeo project (https://pangeo.io/architecture.html) uses large Zarr stores as a performant format in the cloud which we can analyse in parallel at scale using distributed computing frameworks like dask.
More and more climate science data is being made publicly available as Zarr in the cloud, often through open data partnerships with cloud providers (e.g. on AWS (https://aws.amazon.com/blogs/publicsector/decrease-geospatia...) ERA-5 on GCP(https://cloud.google.com/storage/docs/public-datasets/era5)).
I personally think that the more that common tooling can be shared between scientific disciplines the better.
- Analysis-Ready, Cloud Optimized ERA5
What are some alternatives?
bioformats2raw - Bio-Formats image file format to raw format converter
zarr-python - An implementation of chunked, compressed, N-dimensional arrays for Python.
era5_in_gee - Functions and Python scripts to ingest ERA5 data into Google Earth Engine
ome-zarr-py - Implementation of next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.
czifile - Read Carl Zeiss(r) Image (CZI) files
ai-models