Reorganizing an MxN 2D array of datapoints into an N-dimensional array

问题

I've got a series of measurements in a 2D array such as

T    mu1  mu2  mu3  a    b    c    d    e
0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  2.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  0.0  3.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  1.0  2.0  0.0  0.0  0.0  0.0  0.0
0.0  0.0  1.0  3.0  0.0  0.0  0.0  0.0  0.0
0.0  1.0  2.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  1.0  2.0  1.0  0.0  0.0  0.0  0.0  0.0
0.0  1.0  2.0  2.0  0.0  0.0  0.0  0.0  0.0
0.0  1.0  2.0  3.0  0.0  0.0  0.0  0.0  0.0
0.0  1.0  3.0  0.0  0.0  0.0  0.0  0.0  0.0
0.0  1.0  3.0  1.0  0.0  0.0  0.0  0.0  0.0
0.0  1.0  3.0  2.0  0.0  0.0  0.0  0.0  0.0
0.0  1.0  3.0  3.0  0.0  0.0  0.0  0.0  0.0
1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
1.0  0.0  0.0  2.0  0.0  0.0  0.0  0.0  0.0
1.0  0.0  0.0  3.0  0.0  0.0  0.0  0.0  0.0
1.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
1.0  0.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0
1.0  0.0  1.0  2.0  0.0  0.0  0.0  0.0  0.0
1.0  0.0  1.0  3.0  0.0  0.0  0.0  0.0  0.0
1.0  1.0  2.0  0.0  0.0  0.0  0.0  0.0  0.0
1.0  1.0  2.0  1.0  0.0  0.0  0.0  0.0  0.0
1.0  1.0  2.0  2.0  0.0  0.0  0.0  0.0  0.0
1.0  1.0  2.0  3.0  0.0  0.0  0.0  0.0  0.0
1.0  1.0  3.0  0.0  0.0  0.0  0.0  0.0  0.0
1.0  1.0  3.0  1.0  0.0  0.0  0.0  0.0  0.0
1.0  1.0  3.0  2.0  0.0  0.0  0.0  0.0  0.0
1.0  1.0  3.0  3.0  0.0  0.0  0.0  0.0  0.0

where T, mu1, mu2 and mu3 are the 4 axes of the variables I control (independent variables). a, b, c, d and e are the measurements I've made (dependent variables).

I would like to convert this 2D array into a 5D array in numpy. By specifying T, mu1, mu2 and mu3 (or at least their 4 indexes) I want to be able to retrieve the corresponding a, b, c, d and e values.

Is there a straightforward way to reshape this kind of array by specifying what columns the axes correspond to? The MultiIndex in Pandas seemed to smartly organize it in a table, but seems ill-suited for high dimensional arrays. I won't necessarily know ahead of time what the shape of the ndarray should be, but it seems to me that based on the values it should be possible to reshape the array properly. The increment values for each axis might also be different, but they will always be uniform.

My current idea involves ignoring the mu1, mu2 and mu3 columns, and stacking sets of T data into a 3D array. From there I would stack sets of 3D mu1 data into a 4D array, and repeat the process with mu2 and mu3. This seems like a tedious process that should have a simple solution though.

回答1:

First, let's make some fake data:

# an N x 5 array containing a regular mesh representing the stimulus params
stim_params = np.mgrid[:2, :3, :4, :5, :6].reshape(5, -1).T

# an N x 3 array representing the output values for each simulation run
output_vals = np.arange(720 * 3).reshape(720, 3)

# shuffle the rows for a bit of added realism
shuf = np.random.permutation(stim_params.shape[0])
stim_params = stim_params[shuf]
output_vals = output_vals[shuf]

Now you can use np.lexsort to get the set of indices that will sort the rows of your 2D array of simulation parameters such that the values in each column are in ascending order. Having done that, you can apply these indices to the rows of simulation output values.

# get the number of unique values for each stimulus parameter
params_shape = tuple(np.unique(col).shape[0] for col in stim_params.T)

# get the set of row indices that will sort the stimulus parameters in ascending
# order, starting with the final column
idx = np.lexsort(stim_params[:, ::-1].T)

# sort and reshape the stimulus parameters:
sorted_params = stim_params[idx].T.reshape((5,) + params_shape)

# sort and reshape the output values
sorted_output = output_vals[idx].T.reshape((3,) + params_shape)

I find that the hardest part is often just trying to wrap your head around what all the different dimensions of the outputs correspond to:

# array of stimulus parameters, with dimensions (n_params, p1, p2, p3, p4, p5)
print(sorted_params.shape)
# (5, 2, 3, 4, 5, 6)

# to check that the sorting worked as expected, we can look at the values of the 
# 5th parameter when all the others are held constant at 0:
print(sorted_params[4, 0, 0, 0, 0, :])
# [0 1 2 3 4 5]

# ... and the 1st parameter when we hold all the others constant:
print(sorted_params[0, :, 0, 0, 0, 0])
# [0, 1]

# ... now let the 1st and 2nd parameters covary:
print(sorted_params[:2, :, :, 0, 0, 0])
# [[[0 0 0]
#   [1 1 1]]

#  [[0 1 2]
#   [0 1 2]]]

Hopefully you get the idea. The same indexing logic applies to the sorted simulation outputs:

# array of outputs, with dimensions (n_outputs, p1, p2, p3, p4, p5)
print(sorted_output.shape)
# (3, 2, 3, 4, 5, 6)

# the first output variable whilst holding the first 4 simulation parameters
# constant at 0:
print(sorted_output[0, 0, 0, 0, 0, :])
# [ 0  3  6  9 12 15]

来源：https://stackoverflow.com/questions/32129572/reorganizing-an-mxn-2d-array-of-datapoints-into-an-n-dimensional-array

标签

python

numpy

multidimensional-array