Pandas apply, but access previously calculated value

问题

Suppose I have a DataFrame (or Series) like this:

     Value
0    0.5
1    0.8
2    -0.2
3    None
4    None
5    None

I wish to create a new Result column.

The value of each result is determined by the previous Value, via an arbitrary function f.

If the previous Value is not available (None or NaN), I wish to use instead the previous Result (and apply f to it, of course).

Using the previous Value is easy, I just need to use shift. However, accessing the previous result doesn't seem to be that simple.

For example, the following code calculates the result, but cannot access the previous result if needed.

df['Result'] = df['Value'].shift(1).apply(f)

Please assume that f is arbitrary, and thus solutions using things like cumsum are not possible.

Obviously, this can be done by iteration, but I want to know if a more Panda-y solution exists.

df['Result'] = None
for i in range(1, len(df)):
  value = df.iloc[i-1, 'Value']
  if math.isnan(value) or value is None:
    value = df.iloc[i-1, 'Result']
  df.iloc[i, 'Result'] = f(value)

Example output, given f = lambda x: x+1:

Bad:

   Value    Result
0    0.5       NaN
1    0.8       1.5
2   -0.2       1.8
3    NaN       0.8
4    NaN       NaN
5    NaN       NaN

Good:

   Value    Result
0    0.5       NaN
1    0.8       1.5
2   -0.2       1.8
3    NaN       0.8
4    NaN       1.8   <-- previous Value not available, used f(previous result)
5    NaN       2.8   <-- same

回答1:

Looks like it has to be a loop to me. And I abhor loops... so when I loop, I use numba

Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.

https://numba.pydata.org/

from numba import njit


@njit
def f(x):
    return x + 1

@njit
def g(a):
    r = [np.nan]
    for v in a[:-1]:
        if np.isnan(v):
            r.append(f(r[-1]))
        else:
            r.append(f(v))
    return r

df.assign(Result=g(df.Value.values))

   Value  Result
0    0.5     NaN
1    0.8     1.5
2   -0.2     1.8
3    NaN     0.8
4    NaN     1.8
5    NaN     2.8

回答2:

I think this might work, but I'm not sure. It stores the previously calculated value in a closure.

def use_previous_if_none(f):
    prev = None
    def wrapped(val):
        nonlocal prev
        if math.isnan(val) or val is None:
                val = prev
        res = f(val)
        prev = res
        return res
    return wrapped

df['Result'] = df.Value.shift(1).apply(use_previous_if_none(f))

回答3:

I suggest a solution without explicit loops. Instead of referencing previous value, it ffil()'s the NaNs and then apply f as many time as required only on the values that were at indeces of NaNs.

We start by defining help function that will call f n times:

def apply_f_n_times(arg):
    x = arg[0]
    n = int(arg[1])
    for i in range(n):
        x = f(x)
    return x

df = pd.DataFrame({'value': [1, 2, 3, 5, None, None, 12, 9, None, 6, 1, None, None, None]})
df['Result'] = df['Value'].shift(1).apply(f)
# the following 2 lines will create counter of consecutive NaNs 
tmp = df['Result'].isnull()
df['Apply_times'] = tmp * (tmp.groupby((tmp != tmp.shift()).cumsum()).cumcount() + 1)
# fill NaNs with previous good value 
df['Result'] = df['Result'].ffill()
# apply N times
df['Result'] = df[['Result', 'Apply_times']].apply(apply_f_n_times, axis=1)

The result:

Out[2]:
      Value  Result  Apply_times
0     1.0     nan            1
1     2.0     2.0            0
2     3.0     3.0            0
3     5.0     4.0            0
4     nan     6.0            0
5     nan     7.0            1
6    12.0     8.0            2
7     9.0    13.0            0
8     nan    10.0            0
9     6.0    11.0            1
10    1.0     7.0            0
11    nan     2.0            0
12    nan     3.0            1
13    nan     4.0            2

回答4:

This might fit into pandas coding style. However, efficiency-wise, I think further test is required. This is not applicable to general functions. This somehow tricked the plus 1 function.

import pandas as pd
import numpy as np


df = pd.DataFrame({'Value':[0.5,0.8,-0.2,None,None,None]})
index = df['Value'].index[df['Value'].apply(np.isnan)]
window = max(index)-min(index)+1
df['next'] =df['Value'].shift(1)


def getX(x):
    last = np.where(~np.isnan(x))[0][-1]
    return (x[last])+len(x)-last



df['plus_one'] = df['next'].rolling(window=3,min_periods=1).apply(lambda x: getX(x))

来源：https://stackoverflow.com/questions/46421928/pandas-apply-but-access-previously-calculated-value

标签

python

pandas

apply

shift