问题
Suppose I have a DataFrame (or Series) like this:
Value
0 0.5
1 0.8
2 -0.2
3 None
4 None
5 None
I wish to create a new Result column.
The value of each result is determined by the previous Value, via an arbitrary function f
.
If the previous Value is not available (None or NaN), I wish to use instead the previous Result (and apply f
to it, of course).
Using the previous Value is easy, I just need to use shift
. However, accessing the previous result doesn't seem to be that simple.
For example, the following code calculates the result, but cannot access the previous result if needed.
df['Result'] = df['Value'].shift(1).apply(f)
Please assume that f
is arbitrary, and thus solutions using things like cumsum
are not possible.
Obviously, this can be done by iteration, but I want to know if a more Panda-y solution exists.
df['Result'] = None
for i in range(1, len(df)):
value = df.iloc[i-1, 'Value']
if math.isnan(value) or value is None:
value = df.iloc[i-1, 'Result']
df.iloc[i, 'Result'] = f(value)
Example output, given f = lambda x: x+1
:
Bad:
Value Result
0 0.5 NaN
1 0.8 1.5
2 -0.2 1.8
3 NaN 0.8
4 NaN NaN
5 NaN NaN
Good:
Value Result
0 0.5 NaN
1 0.8 1.5
2 -0.2 1.8
3 NaN 0.8
4 NaN 1.8 <-- previous Value not available, used f(previous result)
5 NaN 2.8 <-- same
回答1:
Looks like it has to be a loop to me. And I abhor loops... so when I loop, I use numba
Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
https://numba.pydata.org/
from numba import njit
@njit
def f(x):
return x + 1
@njit
def g(a):
r = [np.nan]
for v in a[:-1]:
if np.isnan(v):
r.append(f(r[-1]))
else:
r.append(f(v))
return r
df.assign(Result=g(df.Value.values))
Value Result
0 0.5 NaN
1 0.8 1.5
2 -0.2 1.8
3 NaN 0.8
4 NaN 1.8
5 NaN 2.8
回答2:
I think this might work, but I'm not sure. It stores the previously calculated value in a closure.
def use_previous_if_none(f):
prev = None
def wrapped(val):
nonlocal prev
if math.isnan(val) or val is None:
val = prev
res = f(val)
prev = res
return res
return wrapped
df['Result'] = df.Value.shift(1).apply(use_previous_if_none(f))
回答3:
I suggest a solution without explicit loops. Instead of referencing previous value, it ffil()'s
the NaNs
and then apply f
as many time as required only on the values that were at indeces of NaNs
.
We start by defining help function that will call f
n
times:
def apply_f_n_times(arg):
x = arg[0]
n = int(arg[1])
for i in range(n):
x = f(x)
return x
df = pd.DataFrame({'value': [1, 2, 3, 5, None, None, 12, 9, None, 6, 1, None, None, None]})
df['Result'] = df['Value'].shift(1).apply(f)
# the following 2 lines will create counter of consecutive NaNs
tmp = df['Result'].isnull()
df['Apply_times'] = tmp * (tmp.groupby((tmp != tmp.shift()).cumsum()).cumcount() + 1)
# fill NaNs with previous good value
df['Result'] = df['Result'].ffill()
# apply N times
df['Result'] = df[['Result', 'Apply_times']].apply(apply_f_n_times, axis=1)
The result:
Out[2]:
Value Result Apply_times
0 1.0 nan 1
1 2.0 2.0 0
2 3.0 3.0 0
3 5.0 4.0 0
4 nan 6.0 0
5 nan 7.0 1
6 12.0 8.0 2
7 9.0 13.0 0
8 nan 10.0 0
9 6.0 11.0 1
10 1.0 7.0 0
11 nan 2.0 0
12 nan 3.0 1
13 nan 4.0 2
回答4:
This might fit into pandas coding style. However, efficiency-wise, I think further test is required. This is not applicable to general functions. This somehow tricked the plus 1 function.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Value':[0.5,0.8,-0.2,None,None,None]})
index = df['Value'].index[df['Value'].apply(np.isnan)]
window = max(index)-min(index)+1
df['next'] =df['Value'].shift(1)
def getX(x):
last = np.where(~np.isnan(x))[0][-1]
return (x[last])+len(x)-last
df['plus_one'] = df['next'].rolling(window=3,min_periods=1).apply(lambda x: getX(x))
来源:https://stackoverflow.com/questions/46421928/pandas-apply-but-access-previously-calculated-value