Calculating percentile for each gridpoint in xarray

你离开我真会死。 提交于 2021-02-17 06:03:56

问题


I am currently using xarray to make probability maps. I want to use a statistical assessment like a “counting” exercise. Meaning, for all data points in NEU count how many times both variables jointly exceed their threshold. That means 1th percentile of the precipitation data and 99th percentile of temperature data. Then the probability (P) of join occurrence is simply the number of joint exceedances divided by the number of data points in your dataset.

<xarray.Dataset>
Dimensions:    (latitude: 88, longitude: 200, time: 6348)
Coordinates:
  * latitude   (latitude) float64 49.62 49.88 50.12 50.38 ... 70.88 71.12 71.38
  * longitude  (longitude) float64 -9.875 -9.625 -9.375 ... 39.38 39.62 39.88
  * time       (time) datetime64[ns] 1950-06-01 1950-06-02 ... 2018-08-31
Data variables:
    rr         (time, latitude, longitude) float32 dask.array<chunksize=(6348, 88, 200), meta=np.ndarray>
    tx         (time, latitude, longitude) float32 dask.array<chunksize=(6348, 88, 200), meta=np.ndarray>
    Ellipsis   float64 0.0

I want to calculate the percentile of both precipitation and temperature for each gridpoint, that means basically that I want to repeat the function below for every gridpoint.

Neu_Precentile=np.nanpercentile(NEU.rr[:,0,0],1)

Can anyone help me out with this problem. I also tried to use xr.apply_ufunc but unfortunately it doesn't worked out well.


回答1:


I'm not sure how you want to process quantiles, but here is a version from which you may be able to adapt.

Also, I chose to keep the dataset structure when computing the quantiles, as it shows how to retrieve the values of the outliers if this is ever relevant (and it is one step away from retrieving the values of valid data points, which is likely relevant).

1. Create some data

coords = ("time", "latitude", "longitude")
sizes = (500, 80, 120)

ds = xr.Dataset(
    coords={c: np.arange(s) for c, s in zip(coords, sizes)},
    data_vars=dict(
        precipitation=(coords, np.random.randn(*sizes)),
        temperature=(coords, np.random.randn(*sizes)),
    ),
)

View of the data:

<xarray.Dataset>
Dimensions:        (latitude: 80, longitude: 120, time: 500)
Coordinates:
  * time           (time) int64 0 1 2 3 ... 496 497 498 499
  * latitude       (latitude) int64 0 1 2 3 ... 76 77 78 79
  * longitude      (longitude) int64 0 1 2 3 ... 117 118 119
Data variables:
    precipitation  (time, latitude, longitude) float64 -1.673 ... -0.3323
    temperature    (time, latitude, longitude) float64 -0.331 ... -0.03728

2. Compute quantiles

qt_dims = ("latitude", "longitude")
qt_values = (0.1, 0.9)

ds_qt = ds.quantile(qt_values, dim=qt_dims)

It is a Dataset, with dimensions of analysis ("latitude", "longitude") lost, and with a new "quantile" dimension:

<xarray.Dataset>
Dimensions:        (quantile: 2, time: 500)
Coordinates:
  * time           (time) int64 0 1 2 3 ... 496 497 498 499
  * quantile       (quantile) float64 0.1 0.9
Data variables:
    precipitation  (quantile, time) float64 -1.305 ... 1.264
    temperature    (quantile, time) float64 -1.267 ... 1.254

3. Compute outliers co-occurrence

For the locations of outliers: (edit: use of np.logical_and, more readable than the & operator)

da_outliers_loc = np.logical_and(
    ds.precipitation > ds_qt.precipitation.sel(quantile=qt_values[0]),
    ds.temperature > ds_qt.temperature.sel(quantile=qt_values[1]),
)

The output is a boolean DataArray:

<xarray.DataArray (time: 500, latitude: 80, longitude: 120)>
array([[[False, ...]]])
Coordinates:
  * time       (time) int64 0 1 2 3 4 ... 496 497 498 499
  * latitude   (latitude) int64 0 1 2 3 4 ... 75 76 77 78 79
  * longitude  (longitude) int64 0 1 2 3 ... 116 117 118 119

And if ever the values are relevant:

ds_outliers = ds.where(
    (ds.precipitation > ds_qt.precipitation.sel(quantile=qt_values[0]))
    & (ds.temperature > ds_qt.temperature.sel(quantile=qt_values[1]))
)

4. Count outliers per timestep

outliers_count = da_outliers_loc.sum(dim=qt_dims)

Finally, here is the DataArray with only a time dimension, and having for values the number of outliers at each timestamp.

<xarray.DataArray (time: 500)>
array([857, ...])
Coordinates:
  * time     (time) int64 0 1 2 3 4 ... 495 496 497 498 499



回答2:


np.nanpercentile by default works on a flattened array, however, in this case, the goal is to reduce only the first dimension generating a 2D array containing the result at each gridpoint. To achieve this, the axis argument of nanpercentile can be used:

np.nanpercentile(NEU.rr, 1, axis=0)

This however will remove the labeled dimensions and coordinates. It is to preserve the dims and coords that apply_ufunc has to be used, it does not vectorize the functions for you.

xr.apply_ufunc(
    lambda x: np.nanpercentile(x, 1, axis=-1), NEU.rr, input_core_dims=[["time"]]
)

Note how now the axis is -1 and we are using input_core_dims which tells apply_ufunc this dimension will be reduced and also moves it to the last position (hence the -1). For a more detailed explanation on apply_ufunc, this other answer may help.



来源:https://stackoverflow.com/questions/62698837/calculating-percentile-for-each-gridpoint-in-xarray

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!