How do I subdivide/refine a dimension in an xarray DataSet?

社会主义新天地 提交于 2021-02-17 01:59:33

问题


Summary: I have a dataset that is collected in such a way that the dimensions are not initially available. I would like to take what is essentially a big block of undifferentiated data and add dimensions to it so that it can be queried, subsetted, etc. That is the core of the following question.

Here is an xarray DataSet that I have:

<xarray.Dataset>
Dimensions:  (chain: 1, draw: 2000, rows: 24000)
Coordinates:
  * chain    (chain) int64 0
  * draw     (draw) int64 0 1 2 3 4 5 6 7 ... 1993 1994 1995 1996 1997 1998 1999
  * rows     (rows) int64 0 1 2 3 4 5 6 ... 23994 23995 23996 23997 23998 23999
Data variables:
    obs      (chain, draw, rows) float64 4.304 3.985 4.612 ... 6.343 5.538 6.475
Attributes:
    created_at:                 2019-12-27T17:16:13.847972
    inference_library:          pymc3
    inference_library_version:  3.8

The rows dimension here corresponds to a number of subdimensions that I need to restore to the data. In particular, the 24,000 rows correspond to 100 samples each from 240 conditions (these 100 samples are in contiguous blocks). These conditions are combinations of gate, input, growth medium, and od.

I would like to end up with something like this:

<xarray.Dataset>
Dimensions:  (chain: 1, draw: 2000, gate: 1, input: 4, growth_medium: 3, sample: 100, rows: 24000)
Coordinates:
  * chain    (chain) int64 0
  * draw     (draw) int64 0 1 2 3 4 5 6 7 ... 1993 1994 1995 1996 1997 1998 1999
  * rows     *MultiIndex*
  * gate     (gate) int64 'AND'
  * input    (input) int64 '00', '01', '10', '11'
  * growth_medium (growth_medium) 'standard', 'rich', 'slow'
  * sample   (sample) int64 0 1 2 3 4 5 6 7 ... 95 96 97 98 99
Data variables:
    obs      (chain, draw, gate, input, growth_medium, samples) float64 4.304 3.985 4.612 ... 6.343 5.538 6.475
Attributes:
    created_at:                 2019-12-27T17:16:13.847972
    inference_library:          pymc3
    inference_library_version:  3.8

I have a pandas dataframe that specifies the values of gate, input, and growth medium -- each row gives a set of values of gate, input, and growth medium, and an index that specifies where (in the rows) the corresponding set of 100 samples appears. The intent is that this data frame is a guide for labeling the Dataset.

I looked at the xarray docs on "Reshaping and Reorganizing Data", but I don't see how to combine those operations to do what I need. I suspect somehow I need to combine these with GroupBy, but I don't get how. Thanks!

Later: I have a solution to this problem, but it is so disgusting that I am hoping someone will explain how wrong I am, and what a more elegant approach is possible.

So, first, I extracted all the data in the original Dataset into raw numpy form:

foo = qm.idata.posterior_predictive['obs'].squeeze('chain').values.T
foo.shape # (24000, 2000)

Then I reshaped it as needed:

bar = np.reshape(foo, (240, 100, 2000))

This gives me roughly the shape I want: there are 240 different experimental conditions, each has 100 variants, and for each of these variants, I have 2000 Monte Carlo samples in my data set.

Now, I extract the information about the 240 experimental conditions from the Pandas DataFrame:

import pandas as pd
# qdf is the original dataframe with the experimental conditions and some
# extraneous information in other columns
new_df = qdf[['gate', 'input', 'output', 'media', 'od_lb', 'od_ub', 'temperature']]
idx = pd.MultiIndex.from_frame(new_df)

Finally, I reassembled a DataArray from the numpy array and the pandas MultiIndex:

xr.DataArray(bar, name='obs', dims=['regions', 'conditions', 'draws'],
             coords={'regions': idx, 'conditions': range(100), 'draws': range(2000)})

The resulting DataArray has these coordinates, as I wished:

Coordinates:
  * regions      (regions) MultiIndex
  - gate         (regions) object 'AND' 'AND' 'AND' 'AND' ... 'AND' 'AND' 'AND'
  - input        (regions) object '00' '10' '10' '10' ... '01' '01' '11' '11'
  - output       (regions) object '0' '0' '0' '0' '0' ... '0' '0' '0' '1' '1'
  - media        (regions) object 'standard_media' ... 'high_osm_media_five_percent'
  - od_lb        (regions) float64 0.0 0.001 0.001 ... 0.0001 0.0051 0.0051
  - od_ub        (regions) float64 0.0001 0.0051 0.0051 2.0 ... 0.0003 2.0 2.0
  - temperature  (regions) int64 30 30 37 30 37 30 37 ... 37 30 37 30 37 30 37
  * conditions   (conditions) int64 0 1 2 3 4 5 6 7 ... 92 93 94 95 96 97 98 99
  * draws        (draws) int64 0 1 2 3 4 5 6 ... 1994 1995 1996 1997 1998 1999

That was pretty horrible, though, and it seems wrong that I had to punch through all the nice layers of xarray abstraction to get to this point. Especially since this does not seem like an unusual piece of a scientific workflow: getting a relatively raw data set together with a spreadsheet of metadata that needs to be combined with the data. So what am I doing wrong? What's the more elegant solution?


回答1:


Given the starting Dataset, similar to:

<xarray.Dataset>
Dimensions:  (draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * row      (row) int32 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

You can concatenate several pure xarray commands to subdivide the dimensions (get the data in the same shape but using a multiindex) or even reshape the Dataset. To subdivide the dimensions, the following code can be used:

multiindex_ds = ds.assign_coords(
    dim_0=["a", "b", "c"], dim_1=[0,1], dim_2=range(4)
).stack(
    dim=("dim_0", "dim_1", "dim_2")
).reset_index(
    "row", drop=True
).rename(
    row="dim"
)
multiindex_ds

whose output is:

<xarray.Dataset>
Dimensions:  (dim: 24, draw: 2)
Coordinates:
  * draw     (draw) int32 0 1
  * dim      (dim) MultiIndex
  - dim_0    (dim) object 'a' 'a' 'a' 'a' 'a' 'a' ... 'c' 'c' 'c' 'c' 'c' 'c'
  - dim_1    (dim) int64 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
  - dim_2    (dim) int64 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Data variables:
    obs      (draw, dim) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

Moreover, the multiindex can then be unstacked, effectively reshaping the Dataset:

reshaped_ds = multiindex_ds.unstack("dim")
reshaped_ds

with output:

<xarray.Dataset>
Dimensions:  (dim_0: 3, dim_1: 2, dim_2: 4, draw: 2)
Coordinates:
  * draw     (draw) int32 0 1
  * dim_0    (dim_0) object 'a' 'b' 'c'
  * dim_1    (dim_1) int64 0 1
  * dim_2    (dim_2) int64 0 1 2 3
Data variables:
    obs      (draw, dim_0, dim_1, dim_2) int32 0 1 2 3 4 5 ... 42 43 44 45 46 47

I think that this alone does not completely cover your needs because you want to convert a dimension into two dimensions, one of which is to be a multiindex. All the building blocks are here though.

For example, you can follow this steps (including unstacking) with regions and conditions and then follow this steps (no unstacking now) to convert regions to multiindex. Another option would be to use all dimensions from the start, unstack them and then stack them again leaving conditions outside of the final multiindex.


Detailed answer

The answer combines several quite unrelated commands, and it might be tricky to see what each of them is doing.

assign_coords

The first step is to create new dimensions and coordinates and add them to the Dataset. This is necessary because the next methods need the dimensions and coordinates to already be present in the Dataset.

Stopping right after assign_coords yields the following Dataset:

<xarray.Dataset>
Dimensions:  (dim_0: 3, dim_1: 2, dim_2: 4, draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * row      (row) int32 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
  * dim_0    (dim_0) <U1 'a' 'b' 'c'
  * dim_1    (dim_1) int32 0 1
  * dim_2    (dim_2) int32 0 1 2 3
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

stack

The Dataset now contains 3 dimensions that add up to 24 elements, however, as the data is currently flat with respect to these 24 elements, we have to stack them into a single 24 element multiindex to make their shapes compatible.

I find the assign_coords followed by stack the most natural solution, however, another possibility would be to generate a multiindex similarly to how it is done above and directly call assign_coords with the multiindex, rendering the stack unnecessary.

This step combines all 3 new dimensions into a single one:

<xarray.Dataset>
Dimensions:  (dim: 24, draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * row      (row) int32 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
  * dim      (dim) MultiIndex
  - dim_0    (dim) object 'a' 'a' 'a' 'a' 'a' 'a' ... 'c' 'c' 'c' 'c' 'c' 'c'
  - dim_1    (dim) int64 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
  - dim_2    (dim) int64 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

Note that as desired now we have 2 dimensions with size 24 as desired.

reset_index

Now we have our final dimension present in the Dataset as a coordinate, and we want this new coordinate to be the one used to index the variable obs. set_index seems like the correct choice, however, each of our coordinates indexes itself (unlike the example in set_index docs where x indexes both x and a coordinates) which means that set_index cannot be used in this particular case. The method to use is reset_index to remove the coordinate row without removing the dimension row.

In the following output it can be seen how now row is a dimension without coordinates:

<xarray.Dataset>
Dimensions:  (dim: 24, draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * dim      (dim) MultiIndex
  - dim_0    (dim) object 'a' 'a' 'a' 'a' 'a' 'a' ... 'c' 'c' 'c' 'c' 'c' 'c'
  - dim_1    (dim) int64 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
  - dim_2    (dim) int64 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Dimensions without coordinates: row
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

rename

The current Dataset is nearly the final one, the only issue is that the obs variable still has the row dimension instead of the desired one: dim. It does not really look like this is intended usage of rename but it can be used to get dim to absorb row, yielding the desired final result (called multiindex_ds above).

Here again, set_index seems to be the method to choose, however, if instead of rename(row="dim"), set_index(row="dim") is used, the multiindex is collapsed into an index made of tuples:

<xarray.Dataset>
Dimensions:  (draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * row      (row) object ('a', 0, 0) ('a', 0, 1) ... ('c', 1, 2) ('c', 1, 3)
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47


来源:https://stackoverflow.com/questions/59504320/how-do-i-subdivide-refine-a-dimension-in-an-xarray-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!