Rolling idxmin/max for pandas DataFrame

问题

I believe the following function is a working solution for pandas DataFrame rolling argmin/max:

import numpy as np

def data_frame_rolling_arg_func(df, window_size, func):
    ws = window_size
    wm1 = window_size - 1
    return (df.rolling(ws).apply(getattr(np, f'arg{func}'))[wm1:].astype(int) +
            np.array([np.arange(len(df) - wm1)]).T).applymap(
                lambda x: df.index[x]).combine_first(df.applymap(lambda x: np.NaN))

It is inspired from a partial solution for rolling idxmax on pandas Series.

Explanations:

Apply the numpy argmin/max function to the rolling window.
Only keep the non-NaN values.
Convert the values to int.
Realign the values to original row numbers.
Use applymap to replace the row numbers by the index values.
Combine with the original DataFrame filled with NaN in order to add the first rows with expected NaN values.


In [1]: index = map(chr, range(ord('a'), ord('a') + 10))

In [2]: df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)

In [3]: df                                                                                                                                                                                                                                                                       
Out[3]: 
    0   1   2
a  -4  15   0
b   0  -6   4
c   7   8 -18
d  11  12 -16
e   6   3  -6
f  -1   4  -9
g   6 -10  -7
h   8  11 -25
i  -2 -10  -8
j   0  10  -7

In [4]: data_frame_rolling_arg_func(df, 3, 'max')                                                                                                                                                                                                                                
Out[4]: 
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    c    a    b
d    d    d    b
e    d    d    e
f    d    d    e
g    e    f    e
h    h    h    g
i    h    h    g
j    h    h    j

In [5]: data_frame_rolling_arg_func(df, 3, 'min')                                                                                                                                                                                                                                
Out[5]: 
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    a    b    c
d    b    b    c
e    e    e    c
f    f    e    d
g    f    g    f
h    f    g    h
i    i    g    h
j    i    i    h

My question are:

Can you find any mistakes?
Is there a better solution? That is: more performant and/or more elegant.

And for pandas maintainers out there: it would be nice if the already great pandas library included rolling idxmax and idxmin.

回答1:

The NaN issue I mentioned in a comment to the OP can be solved in the following manner:

import numpy as np
import pandas as pd


def data_frame_rolling_idx_func(df, window_size, func):
    ws = window_size
    wm1 = window_size - 1
    return (df.rolling(ws, min_periods=0).apply(getattr(np, f'arg{func}'),
                                                raw=True)[wm1:].astype(int) +
            np.array([np.arange(len(df) - wm1)]).T).applymap(
                lambda x: df.index[x]).combine_first(df.applymap(lambda x: np.NaN))


def main():
    index = map(chr, range(ord('a'), ord('a') + 10))
    df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)
    df[0][3:6] = np.NaN
    print(df)
    print(data_frame_rolling_arg_func(df, 3, 'min'))
    print(data_frame_rolling_arg_func(df, 3, 'max'))


if __name__ == "__main__":
    main()

Result:

$ python demo.py 
      0   1   2
a   3.0   0   7
b   1.0   3  11
c   1.0  15  -6
d   NaN   2 -16
e   NaN   0  24
f   NaN   0  14
g   2.0   0   4
h  -1.0 -11  16
i  17.0   0  -2
j   3.0  -5  -8
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    b    a    c
d    d    d    d
e    d    e    d
f    d    e    d
g    e    e    g
h    f    h    g
i    h    h    i
j    h    h    j
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    a    c    b
d    d    c    b
e    d    c    e
f    d    d    e
g    e    e    e
h    f    f    h
i    i    g    h
j    i    i    h

The handling of NaN values is a little subtle. I want my rolling idxmin/max function to cooperate well with the regular DataFrame rolling min/max functions. These, by default, will generate a NaN value as soon as the window input shows a NaN value. And so will the rolling apply function by default. But for the apply function, that is a problem, because I will not be able to transform the NaN value into an index. However this is a pity, since the NaN values in the output show up because they can be found in the input, so the NaN value index in the input is what I would like my rolling idxmin/max function to produce. Fortunately, this is exactly what I will get if I use the following combination of parameters:

min_periods=0 for the pandas rolling function. The apply function will then get a chance to produce its own value regardless of how many NaN values are found in the input window.
raw=True for the apply function. This parameter ensures that the input to the applied function is passed as a numpy array instead of a pandas Series. np.argmin/max will then return the index of the first input NaN value, which is exactly what we want. It should be noted that without raw=True, i.e. in the pandas Series case, np.argmin/max seems to ignore the NaN values, which is NOT what we want. The nice thing with raw=True is that it should improve performance too! More about that later.

回答2:

The solution in my previous answer manages to give proper index values for NaN input values, but I have realized that this is most probably not what a native pandas rolling idxmin/idxmax would do by default. By default, it would produce a NaN value if there is one or more NaN values in the window.

I came up with a variant of my solution, which does that:

import numpy as np
import pandas as pd


def transform_if_possible(func):
    def f(i):
        try:
            return func(i)
        except ValueError:
            return i
    return f


int_if_possible = transform_if_possible(int)


def data_frame_rolling_idx_func(df, window_size, func):
    ws = window_size
    wm1 = window_size - 1

    index_if_possible = transform_if_possible(lambda i: df.index[i])

    return (df.rolling(ws).apply(getattr(np, f'arg{func}'), raw=True).applymap(int_if_possible) +
            np.array([np.arange(len(df)) - wm1]).T).applymap(index_if_possible)


def main():
    print(int_if_possible(1.2))
    print(int_if_possible(np.NaN))
    index = map(chr, range(ord('a'), ord('a') + 10))
    df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)
    df[0][3:6] = np.NaN
    print(df)
    print(data_frame_rolling_idx_func(df, 3, 'min'))
    print(data_frame_rolling_idx_func(df, 3, 'max'))


if __name__ == "__main__":
    main()

Results:

1
nan
      0   1   2
a  15.0  -2  13
b  -6.0  -4  -3
c -12.0  -7  -8
d   NaN   0  -4
e   NaN  -1 -11
f   NaN  -9  10
g  -1.0  24   1
h -15.0  14 -16
i   7.0  -4  14
j  -1.0   4  10
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    c    c    c
d  NaN    c    c
e  NaN    c    e
f  NaN    f    e
g  NaN    f    e
h  NaN    f    h
i    h    i    h
j    h    i    h
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    a    a    a
d  NaN    d    b
e  NaN    d    d
f  NaN    d    f
g  NaN    g    f
h  NaN    g    f
i    i    g    i
j    i    h    i

To achieve my goal, I am using two functions to transform values into integers, and row numbers into index values, respectively, which leave NaN unchanged. I construct these functions with the help of a common closure, transform_if_possible. In the second case, since the index transformation is dependent on the DataFrame, I construct the transformation function from a local lambda function.

Apart from these aspects, the solution is similar to my previous one, but since NaN is explicitly handled, I know longer need a special handling of the first window_size - 1 rows, so the code is a little shorter.

A nice side effect of this solution is that the running time seems to be lower: a little over three times the running time of the corresponding rolling min/max, instead of five times.

All in all, a better solution I think.

来源：https://stackoverflow.com/questions/65526535/rolling-idxmin-max-for-pandas-dataframe

标签

python

pandas

dataframe

indexing

rolling-computation