Great post! Hi Ali! I think what's missing here is an analysis of what is gained...

PolarizedPoutin · on April 16, 2024

Hey Ryan and thank you for the feedback!

I agree that storing the data is appropriately chunked Zarr files is almost surely going to be faster, simpler to set up, and take up less space. Could even put up an API in front of it to get "queries".

I also agree that I haven't motivated the RDBMS approach much. This is mainly because I took this approach with Postgres + Timescale since I wanted to learn to work with them, and playing around with ERA5 data seemed like the most fun way. Maybe it's the allure of weather data being big enough to pose a challenge here.

I don't have anything to back this up but I wonder if the RDBMS approach, with properly tuned and indexed TimescaleDB + PostGIS (non-trivial to set up), can speed up complex spatio-temporal queries, e.g. computing the 99th percentile of summer temperatures in Chile from 1940-1980, in case many different Zarr chunks have to be read to find this data. I like the idea of setting up different tables to cache these kinds of statistics, but it's not that hard to do with Zarr either.

I'm benchmarking queries and indexes next so I might know more then!

ohmahjong · on April 16, 2024

This is a bit off-topic but I'm interested in the same space you are in.

There seems to be an inherent pull between large chunks (great for visualising large extents and larger queries) vs smaller chunks for point-based or timeseries queries. It's possible but not very cost-effective to maintain separately-chunked versions of these large geospatial datasets. I have heard of "kerchunk" being used to try and get the best of both, but then I _think_ you lose out on the option of compressing the data and it introduces quite a lot of complexity.

What are your thoughts on how to strike that balance between use cases?

rabernat · on April 16, 2024

> It's possible but not very cost-effective to maintain separately-chunked versions of these large geospatial datasets.

Like all things in tech, it's about tradeoffs. S3 storage costs about $275 TB a year. Typical weather datasets are ~10 TB. If you're running a business that uses weather data in operations to make money, you could easily afford to make 2-3 copies that are optimized for different query patterns. We see many teams doing this today in production. That's still much cheaper (and more flexible) than putting the same volume of data in a RDBMS, given the relative cost of S3 vs. persistent disks.

The real hidden costs of all of these solutions is the developer time operating the data pipelines for the transformation.

ohmahjong · on April 17, 2024

That's a great point, it really is all about tradeoffs. In my use case there is strong motivation to keep data creation times low, so writing out multiple datasets comes at a product/opportunity cost moreso than a storage cost. Thanks for the insight.