Skip to main content

I’m working on getting an interesting derived dataset into Sentinel Hub, that’s distributed as 2600+ non-COG geotiffs, totaling 26gb, but only 834 mb zipped up. So it’s not well compressed - if I process those with ‘deflate’ then it comes down to 11gb. See https://zenodo.org/records/10907151 for the data.

I’m working with a few of these types of datasets, some with more data, and am mostly just wondering how I should reason about putting it all in one COG vs many small files / COG’s that I’d register as ‘tiles’? 

Like at the extreme I could take these 2600 geotiff’s, turn them into COGs and make each of them a ‘tile’? But it seems like that probably won't work all that well as there’s no dataset-wide ‘overview’ (unless SH generates something like that? At the other end I could try to put them all into a single COG, maybe it ends up 15 gigabytes or something? With overviews it seems like it should perform decently, and is easier conceptually.

But then I’m curious if there’s an upper limit - if I have 8 band global data at 3-meters should I make a 300gb COG? Or is it a better practice to break it up in some way.

I've not yet managed to get the BYOC-tool working yet, so perhaps there’s some advice embedded in using that, but I couldn’t find anything online for this question. I may well be misunderstanding something, but just looking for advice on how to process my geotiff's to work well on BYOC.

thanks!

 

Hi @cholmes!

I cannot answer your question in its entirety, but perhaps I can share some experience related to the issue at hand.

It’s true that calculating overviews on the small tiles won’t really help, so it’s a good idea to merge them into larger tiles. I’m not aware of any guidelines, but in our team (dotAI) we tend to merge pipeline results (tiles 10km x 10km at 10m res) into UTM grids, which we then COGify. This seemed to work well for the past few years. If your data is too dense, you could also use the MGRS grid, which offers a bit higher granularity than the UTM grid. Looking at the provided data perhaps the naming of the files offers some hierarchical grouping already, which could be used to this purpose?

I can also point you to our python utilities for creating COG overviews here: https://hello.planet.com/code/eo/code/eo-grow/-/blob/develop/eogrow/utils/map.py?ref_type=heads#L84-L147. These will create all overviews up to the specified block size. You can also try the 2048 block size. This utility has been tested successfully with SH ingestion many times.

 

Hopefully any of this helps.

Cheers!


SH doesn't generate dataset-wide overview.

 

Regarding "putting it all in one COG vs many small files / COG’s" and viewing them at dataset-wide resolution. There are two things (limits) to keep in mind.

 

The first is SH TIFF header size limit, which limits TIFF headers to at most 1 MB. So you are limited to how many data (internal TIFF tiles/blocks) can you have per COG. This limit is listed under BYOC constraints here.

 

The second is SH request (output) resolution limit, which is calculated from the last COG level of your tiles and according to the formula here. So if you want to process data at dataset-wide resolution, your data need to support this.

 

In byoc-tool, there's no advice embedded there regarding this.


To add to my previous post. I would first think at what resolution I want to process my data. Then I would try to create such COGs that support this. If such COGs would be too big (= would hit SH header size limit), then there no way to process the data at desired resolution.


I gave a wrong like to BYOC constraints. It’s https://docs.sentinel-hub.com/api/latest/api/byoc/#constraints-and-settings.


Great, this is helpful, thanks all!

I do think for me it’s generally a goal to have the overviews work ‘globally’ - it’s always a bummer for me when I get the message that SH can’t display the image until I zoom in. Often I don’t know exactly where to zoom in, so I’m left blindly trying different areas, or trying to look at the data elsewhere. 

It does seem like the general advice seems to lean towards ‘make your COG’s as large as possible, but be sure to fit in the 1mb header constraint’. 

Following UTM grids seems reasonable, and I also heard from another team that 10 degree grids also seemed to work well for being able to expose global datasets at full resolution w/ overviews that work at global scale. I’ll try these out, but it does seem like this would be a good set of recommendations to write up.

It could also be nice to work towards a tool that could take a set of geotiff inputs and then properly merge / retile / cog them into more ideal sizes. Like there’s good recommendations on a single COG, but I found less on having continental or global scale datasets and how to best tile them. But until then this thread should help.


This is interesting. We had somehow similar problem when generating global Sentinel-2 cloudless mosaics for CDSE. Requirements relevant for this debate were:

  • keep original Sentinel-2 spatial resolution (i.e. 10m)
  • the tiling should preferably follow the  Sentinel-2 tiling
  • it should be possible to display/see the mosaic for the whole world at once

To fulfill the first two reqs,, we generated mosaics at 10m resolution using the UTM grid with 100kmx100km tiles. This was ingested in BYOC collection but we could not use this collection to display world-wide mosaic with SH. We then processed so called “low resolution” mosaics, where we basically down-sampled original mosaic and created another set of cogs (with overviews). The down-sampled mosaics were ingested as a separate BYOC collection, which is now used to display the mosaics at smaller scales. The Copernicus Browser handles switching between both collection when users zoom in and out, so the final user experience is quite good imo.

However, the whole process of getting there is relatively complicated for the data producer/owner: i.e. data producer having to run a separate process to down-sample the data and ingesting it into two separate BYOC collections, where some external process than need to handle transitions between both collections). We are thinking about how to improve this but no concrete plan yet. Any feedback or use case like the one you described above is helpful and much welcome.  


Reply