Faster reading of time series from netCDF?

為{幸葍}努か 提交于 2019-11-30 05:17:12

I think the answer to this problem won't be so much re-ordering the data as it will be chunking the data. For a full discussion on the implications of chunking netCDF files, see the following blog posts from Russ Rew, lead netCDF developer at Unidata:

The upshot is that while employing different chunking strategies can achieve large increases in access speed, choosing the right strategy is non-trivial.

On the smaller sample dataset, sst.wkmean.1990-present.nc, I saw the following results when using your benchmark command:

1) Unchunked:

## test replications elapsed relative user.self sys.self user.child sys.child
## 2 spacechunk         1000   0.841    1.000     0.812    0.029          0         0
## 1 timeseries         1000   1.325    1.576     0.944    0.381          0         0

2) Naively Chunked:

## test replications elapsed relative user.self sys.self user.child sys.child
## 2 spacechunk         1000   0.788    1.000     0.788    0.000          0         0
## 1 timeseries         1000   0.814    1.033     0.814    0.001          0         0

The naive chunking was simply a shot in the dark; I used the nccopy utility thusly:

$ nccopy -c"lat/100,lon/100,time/100,nbnds/" sst.wkmean.1990-present.nc chunked.nc

The Unidata documentation for the nccopy utility can be found here.

I wish I could recommend a particular strategy for chunking your data set, but it is highly dependent on the data. Hopefully the articles linked above will give you some insight into how you might chunk your data to achieve the results you're looking for!

Update

The following blog post by Marcos Hermida shows how different chunking strategies influenced the speed when reading a time series for a particular netCDF file. This should only be used as perhaps a jumping off point.

In regards to rechunking via nccopy apparently hanging; the issue appears to be related to the default chunk cache size of 4MB. By increasing that to 4GB (or more), you can reduce the copy time from over 24 hours for a large file to under 11 minutes!

One point I'm not sure about; in the first link, the discussion is in regards to the chunk cache, but the argument passed to nccopy, -m, specifies the number of bytes in the copy buffer. The -m argument to nccopy controls the size of the chunk cache.

EDIT: original question had a mistake, but there might also be different overheads for starting the read, so it's fair to do multiple reps. rbenchmark makes that easy.

The example file is a bit massive so I've used a smaller one, can you make the same comparison with your file?

More accessible example file: ftp://ftp.cdc.noaa.gov/Datasets/noaa.oisst.v2/sst.wkmean.1990-present.nc

I get more like twice the time taken for a time series:

library(ncdf4)

nc <- nc_open("sst.wkmean.1990-present.nc")

library(rbenchmark)
benchmark(timeseries = ncvar_get(nc, "sst", start = c(1, 1, 50), count = c(10, 10, 100)), 
spacechunk = ncvar_get(nc, "sst", start = c(1, 1, 50), count = c(100, 100, 1)),           
replications = 1000)
##        test replications elapsed relative user.self sys.self user.child sys.child
##2 spacechunk         1000    0.47    1.000      0.43     0.03         NA        NA
##1 timeseries         1000    1.04    2.213      0.58     0.47         NA        NA

Not sure if you have considered cdo to extract the point ?

cdo remapnn,lon=x/lat=y in.nc point.nc 

Sometimes CDO runs out of memory, if this happens, you might need to loop over the yearly files, and then cat the separate point files with

cdo mergetime point_${yyyy}.nc point_series.nc 
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!