Rolling idxmin/max for pandas DataFrame

随声附和 提交于 2021-01-24 06:56:32

问题


I believe the following function is a working solution for pandas DataFrame rolling argmin/max:

import numpy as np

def data_frame_rolling_arg_func(df, window_size, func):
    ws = window_size
    wm1 = window_size - 1
    return (df.rolling(ws).apply(getattr(np, f'arg{func}'))[wm1:].astype(int) +
            np.array([np.arange(len(df) - wm1)]).T).applymap(
                lambda x: df.index[x]).combine_first(df.applymap(lambda x: np.NaN))

It is inspired from a partial solution for rolling idxmax on pandas Series.

Explanations:

  • Apply the numpy argmin/max function to the rolling window.
  • Only keep the non-NaN values.
  • Convert the values to int.
  • Realign the values to original row numbers.
  • Use applymap to replace the row numbers by the index values.
  • Combine with the original DataFrame filled with NaN in order to add the first rows with expected NaN values.

In [1]: index = map(chr, range(ord('a'), ord('a') + 10))

In [2]: df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)

In [3]: df                                                                                                                                                                                                                                                                       
Out[3]: 
    0   1   2
a  -4  15   0
b   0  -6   4
c   7   8 -18
d  11  12 -16
e   6   3  -6
f  -1   4  -9
g   6 -10  -7
h   8  11 -25
i  -2 -10  -8
j   0  10  -7

In [4]: data_frame_rolling_arg_func(df, 3, 'max')                                                                                                                                                                                                                                
Out[4]: 
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    c    a    b
d    d    d    b
e    d    d    e
f    d    d    e
g    e    f    e
h    h    h    g
i    h    h    g
j    h    h    j

In [5]: data_frame_rolling_arg_func(df, 3, 'min')                                                                                                                                                                                                                                
Out[5]: 
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    a    b    c
d    b    b    c
e    e    e    c
f    f    e    d
g    f    g    f
h    f    g    h
i    i    g    h
j    i    i    h

My question are:

  • Can you find any mistakes?
  • Is there a better solution? That is: more performant and/or more elegant.

And for pandas maintainers out there: it would be nice if the already great pandas library included rolling idxmax and idxmin.


回答1:


The NaN issue I mentioned in a comment to the OP can be solved in the following manner:

import numpy as np
import pandas as pd


def data_frame_rolling_idx_func(df, window_size, func):
    ws = window_size
    wm1 = window_size - 1
    return (df.rolling(ws, min_periods=0).apply(getattr(np, f'arg{func}'),
                                                raw=True)[wm1:].astype(int) +
            np.array([np.arange(len(df) - wm1)]).T).applymap(
                lambda x: df.index[x]).combine_first(df.applymap(lambda x: np.NaN))


def main():
    index = map(chr, range(ord('a'), ord('a') + 10))
    df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)
    df[0][3:6] = np.NaN
    print(df)
    print(data_frame_rolling_arg_func(df, 3, 'min'))
    print(data_frame_rolling_arg_func(df, 3, 'max'))


if __name__ == "__main__":
    main()

Result:

$ python demo.py 
      0   1   2
a   3.0   0   7
b   1.0   3  11
c   1.0  15  -6
d   NaN   2 -16
e   NaN   0  24
f   NaN   0  14
g   2.0   0   4
h  -1.0 -11  16
i  17.0   0  -2
j   3.0  -5  -8
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    b    a    c
d    d    d    d
e    d    e    d
f    d    e    d
g    e    e    g
h    f    h    g
i    h    h    i
j    h    h    j
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    a    c    b
d    d    c    b
e    d    c    e
f    d    d    e
g    e    e    e
h    f    f    h
i    i    g    h
j    i    i    h

The handling of NaN values is a little subtle. I want my rolling idxmin/max function to cooperate well with the regular DataFrame rolling min/max functions. These, by default, will generate a NaN value as soon as the window input shows a NaN value. And so will the rolling apply function by default. But for the apply function, that is a problem, because I will not be able to transform the NaN value into an index. However this is a pity, since the NaN values in the output show up because they can be found in the input, so the NaN value index in the input is what I would like my rolling idxmin/max function to produce. Fortunately, this is exactly what I will get if I use the following combination of parameters:

  • min_periods=0 for the pandas rolling function. The apply function will then get a chance to produce its own value regardless of how many NaN values are found in the input window.
  • raw=True for the apply function. This parameter ensures that the input to the applied function is passed as a numpy array instead of a pandas Series. np.argmin/max will then return the index of the first input NaN value, which is exactly what we want. It should be noted that without raw=True, i.e. in the pandas Series case, np.argmin/max seems to ignore the NaN values, which is NOT what we want. The nice thing with raw=True is that it should improve performance too! More about that later.



回答2:


The solution in my previous answer manages to give proper index values for NaN input values, but I have realized that this is most probably not what a native pandas rolling idxmin/idxmax would do by default. By default, it would produce a NaN value if there is one or more NaN values in the window.

I came up with a variant of my solution, which does that:

import numpy as np
import pandas as pd


def transform_if_possible(func):
    def f(i):
        try:
            return func(i)
        except ValueError:
            return i
    return f


int_if_possible = transform_if_possible(int)


def data_frame_rolling_idx_func(df, window_size, func):
    ws = window_size
    wm1 = window_size - 1

    index_if_possible = transform_if_possible(lambda i: df.index[i])

    return (df.rolling(ws).apply(getattr(np, f'arg{func}'), raw=True).applymap(int_if_possible) +
            np.array([np.arange(len(df)) - wm1]).T).applymap(index_if_possible)


def main():
    print(int_if_possible(1.2))
    print(int_if_possible(np.NaN))
    index = map(chr, range(ord('a'), ord('a') + 10))
    df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)
    df[0][3:6] = np.NaN
    print(df)
    print(data_frame_rolling_idx_func(df, 3, 'min'))
    print(data_frame_rolling_idx_func(df, 3, 'max'))


if __name__ == "__main__":
    main()

Results:

1
nan
      0   1   2
a  15.0  -2  13
b  -6.0  -4  -3
c -12.0  -7  -8
d   NaN   0  -4
e   NaN  -1 -11
f   NaN  -9  10
g  -1.0  24   1
h -15.0  14 -16
i   7.0  -4  14
j  -1.0   4  10
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    c    c    c
d  NaN    c    c
e  NaN    c    e
f  NaN    f    e
g  NaN    f    e
h  NaN    f    h
i    h    i    h
j    h    i    h
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    a    a    a
d  NaN    d    b
e  NaN    d    d
f  NaN    d    f
g  NaN    g    f
h  NaN    g    f
i    i    g    i
j    i    h    i

To achieve my goal, I am using two functions to transform values into integers, and row numbers into index values, respectively, which leave NaN unchanged. I construct these functions with the help of a common closure, transform_if_possible. In the second case, since the index transformation is dependent on the DataFrame, I construct the transformation function from a local lambda function.

Apart from these aspects, the solution is similar to my previous one, but since NaN is explicitly handled, I know longer need a special handling of the first window_size - 1 rows, so the code is a little shorter.

A nice side effect of this solution is that the running time seems to be lower: a little over three times the running time of the corresponding rolling min/max, instead of five times.

All in all, a better solution I think.



来源:https://stackoverflow.com/questions/65526535/rolling-idxmin-max-for-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!