pandas - cumulative median

痞子三分冷 提交于 2019-12-02 01:19:27

You can use expanding.median -

df.a.expanding().median()

1    5.0
2    6.0
3    6.0
4    5.5
Name: a, dtype: float64

Timings

df = pd.DataFrame({'a' : np.arange(1000000)})

%timeit df['a'].apply(cummedian())
1 loop, best of 3: 1.69 s per loop

%timeit df.a.expanding().median()
1 loop, best of 3: 838 ms per loop

The winner is expanding.median by a huge margin. Divakar's method is memory intensive and suffers memory blowout at this size of input.

We could create nan filled subarrays as rows with a strides based function, like so -

def nan_concat_sliding_windows(x):
    n = len(x)
    add_arr = np.full(n-1, np.nan)
    x_ext = np.concatenate((add_arr, x))
    strided = np.lib.stride_tricks.as_strided
    nrows = len(x_ext)-n+1
    s = x_ext.strides[0]
    return strided(x_ext, shape=(nrows,n), strides=(s,s))

Sample run -

In [56]: x
Out[56]: array([5, 6, 7, 4])

In [57]: nan_concat_sliding_windows(x)
Out[57]: 
array([[ nan,  nan,  nan,   5.],
       [ nan,  nan,   5.,   6.],
       [ nan,   5.,   6.,   7.],
       [  5.,   6.,   7.,   4.]])

Thus, to get sliding median values for an array x, we would have a vectorized solution, like so-

np.nanmedian(nan_concat_sliding_windows(x), axis=1)

Hence, the final solution would be -

In [54]: df
Out[54]: 
a
1  5
2  7
3  6
4  4

In [55]: pd.Series(np.nanmedian(nan_concat_sliding_windows(df.a.values), axis=1))
Out[55]: 
0    5.0
1    6.0
2    6.0
3    5.5
dtype: float64

A faster solution for the specific cumulative median

In [1]: import timeit

In [2]: setup = """import bisect
   ...: import pandas as pd
   ...: def cummedian():
   ...:     l = []
   ...:     info = [0, True]
   ...:     def inner(n):
   ...:         bisect.insort(l, n)
   ...:         info[0] += 1
   ...:         info[1] = not info[1]
   ...:         median = info[0] // 2
   ...:         if info[1]:
   ...:             return (l[median] + l[median - 1]) / 2
   ...:         else:
   ...:             return l[median]
   ...:     return inner
   ...: df = pd.DataFrame({'a': range(20)})"""

In [3]: timeit.timeit("df['cummedian'] = df['a'].apply(cummedian())",setup=setup,number=100000)
Out[3]: 27.11604686321956

In [4]: timeit.timeit("df['expanding'] = df['a'].expanding().median()",setup=setup,number=100000)
Out[4]: 48.457676260100335

In [5]: 48.4576/27.116
Out[5]: 1.7870482372031273
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!