问题
Suppose I have a DataFrame (or Series) like this:
Value
0 0.5
1 0.8
2 -0.2
3 None
4 None
5 None
I wish to create a new Result column.
The value of each result is determined by the previous Value, via an arbitrary function f.
If the previous Value is not available (None or NaN), I wish to use instead the previous Result (and apply f to it, of course).
Using the previous Value is easy, I just need to use shift. However, accessing the previous result doesn't seem to be that simple.
For example, the following code calculates the result, but cannot access the previous result if needed.
df['Result'] = df['Value'].shift(1).apply(f)
Please assume that f is arbitrary, and thus solutions using things like cumsum are not possible.
Obviously, this can be done by iteration, but I want to know if a more Panda-y solution exists.
df['Result'] = None
for i in range(1, len(df)):
value = df.iloc[i-1, 'Value']
if math.isnan(value) or value is None:
value = df.iloc[i-1, 'Result']
df.iloc[i, 'Result'] = f(value)
Example output, given f = lambda x: x+1:
Bad:
Value Result
0 0.5 NaN
1 0.8 1.5
2 -0.2 1.8
3 NaN 0.8
4 NaN NaN
5 NaN NaN
Good:
Value Result
0 0.5 NaN
1 0.8 1.5
2 -0.2 1.8
3 NaN 0.8
4 NaN 1.8 <-- previous Value not available, used f(previous result)
5 NaN 2.8 <-- same
回答1:
Looks like it has to be a loop to me. And I abhor loops... so when I loop, I use numba
Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
https://numba.pydata.org/
from numba import njit
@njit
def f(x):
return x + 1
@njit
def g(a):
r = [np.nan]
for v in a[:-1]:
if np.isnan(v):
r.append(f(r[-1]))
else:
r.append(f(v))
return r
df.assign(Result=g(df.Value.values))
Value Result
0 0.5 NaN
1 0.8 1.5
2 -0.2 1.8
3 NaN 0.8
4 NaN 1.8
5 NaN 2.8
回答2:
I think this might work, but I'm not sure. It stores the previously calculated value in a closure.
def use_previous_if_none(f):
prev = None
def wrapped(val):
nonlocal prev
if math.isnan(val) or val is None:
val = prev
res = f(val)
prev = res
return res
return wrapped
df['Result'] = df.Value.shift(1).apply(use_previous_if_none(f))
回答3:
I suggest a solution without explicit loops. Instead of referencing previous value, it ffil()'s the NaNs and then apply f as many time as required only on the values that were at indeces of NaNs.
We start by defining help function that will call f n times:
def apply_f_n_times(arg):
x = arg[0]
n = int(arg[1])
for i in range(n):
x = f(x)
return x
df = pd.DataFrame({'value': [1, 2, 3, 5, None, None, 12, 9, None, 6, 1, None, None, None]})
df['Result'] = df['Value'].shift(1).apply(f)
# the following 2 lines will create counter of consecutive NaNs
tmp = df['Result'].isnull()
df['Apply_times'] = tmp * (tmp.groupby((tmp != tmp.shift()).cumsum()).cumcount() + 1)
# fill NaNs with previous good value
df['Result'] = df['Result'].ffill()
# apply N times
df['Result'] = df[['Result', 'Apply_times']].apply(apply_f_n_times, axis=1)
The result:
Out[2]:
Value Result Apply_times
0 1.0 nan 1
1 2.0 2.0 0
2 3.0 3.0 0
3 5.0 4.0 0
4 nan 6.0 0
5 nan 7.0 1
6 12.0 8.0 2
7 9.0 13.0 0
8 nan 10.0 0
9 6.0 11.0 1
10 1.0 7.0 0
11 nan 2.0 0
12 nan 3.0 1
13 nan 4.0 2
回答4:
This might fit into pandas coding style. However, efficiency-wise, I think further test is required. This is not applicable to general functions. This somehow tricked the plus 1 function.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Value':[0.5,0.8,-0.2,None,None,None]})
index = df['Value'].index[df['Value'].apply(np.isnan)]
window = max(index)-min(index)+1
df['next'] =df['Value'].shift(1)
def getX(x):
last = np.where(~np.isnan(x))[0][-1]
return (x[last])+len(x)-last
df['plus_one'] = df['next'].rolling(window=3,min_periods=1).apply(lambda x: getX(x))
来源:https://stackoverflow.com/questions/46421928/pandas-apply-but-access-previously-calculated-value