Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

问题

Question

Is there a good way to transform a DataFrame with an n-level index into an n-D Numpy array (a.k.a n-tensor)?

Example

Suppose I set up a DataFrame like

from pandas import DataFrame, MultiIndex

index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
                  index=MultiIndex.from_product(index)).drop((1, 0))
print frame

which outputs

The index is a 2-level hierarchical index. I can extract a 2-D Numpy array from the data using

print frame.unstack().values

which outputs

[[  0.   1.   2.]
 [ nan   4.   5.]]

How does this generalize to an n-level index?

Playing with unstack(), it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.

I cannot use e.g. frame.values.reshape(x, y, z), since this would require that the frame contains exactly x * y * z rows, which cannot be guaranteed. This is what I tried to demonstrate by drop()ing a row in the above example.

Any suggestions are highly appreciated.

回答1:

Edit. This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.

# create an empty array of NaN of the right dimensions
shape = map(len, frame.index.levels)
arr = np.full(shape, np.nan)

# fill it using Numpy's advanced indexing
arr[frame.index.labels] = frame.values.flat

Original solution. Given a setup similar to above, but in 3-D,

from pandas import DataFrame, MultiIndex
from itertools import product

index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
                  index=MultiIndex.from_product(index)).drop((1, 0, 1))
print(frame)

we have

       value
0 0 0      0
    1      1
  1 0      2
    1      3
1 0 0      4
  1 0      6
    1      7

Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.

First, reindex the data frame with the full cartesian product of all dimensions. NaN values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.

levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)

which outputs

       value
0 0 0      0
    1      1
  1 0      2
    1      3
1 0 0      4
    1    NaN
  1 0      6
    1      7

Now, reshape() will work as intended.

shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))

which outputs

[[[  0.   1.]
  [  2.   3.]]

 [[  4.  nan]
  [  6.   7.]]]

The (rather ugly) one-liner is

frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
     .reshape(map(len, frame.index.levels))

来源：https://stackoverflow.com/questions/35047882/transform-pandas-dataframe-with-n-level-hierarchical-index-into-n-d-numpy-array

标签

python

pandas

multidimensional-array

multi-index