问题
I have a situation where I have a dataframe row to perform calculations with, and I need to use values in following (potentially preceding) rows to do these calculations (essentially a perfect forecast based on the real data set). I get each row from an earlier df.apply
call, so I could pass the whole df along to the downstream objects, but that seems less than ideal based on the complexity of objects in my analysis.
I found one closely related question and answer [1], but the problem is actually fundamentally different in the sense that I do not need the whole df for my calcs, simply the following x
number of rows (which might matter for large dfs).
So, for example:
df = pd.DataFrame([100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
columns=['PRICE'])
horizon = 3
I need to access values in the following 3 (horizon
) rows in my row-wise df.apply
call. How can I get a naive forecast of the next 3 data points dynamically in my row-wise apply calcs? e.g. for row the first row, where the PRICE
is 100
, I need to use [200, 300, 400]
as a forecast in my calcs.
[1] apply a function to a pandas Dataframe whose returned value is based on other rows
回答1:
By getting the row's index inside of the df.apply
call using row.name
[1], you can generate the 'forecast' data relative to which row you are currently on. This is effectively a preprocessing step to put the 'forecast' onto the relevant row, or it could be done as part of the initial df.apply
call if the df is available downstream.
df = pd.DataFrame([100, 200, 300, 400, 500, 600, 700, 800, 900, 1000], columns=['PRICE'])
horizon = 3
df['FORECAST'] = df.apply(lambda x: [df['PRICE'][x.name+1:x.name+horizon+1]], axis=1)
Results in this:
PRICE FORECAST
0 100 [200, 300, 400]
1 200 [300, 400, 500]
2 300 [400, 500, 600]
3 400 [500, 600, 700]
4 500 [600, 700, 800]
5 600 [700, 800, 900]
6 700 [800, 900, 1000]
7 800 [900, 1000]
8 900 [1000]
9 1000 []
Which can be used in your row-wise df.apply
calcs.
EDIT: If you want to strip the index from the resulting 'Forecast':
df['FORECAST'] = df.apply(lambda x: [df['PRICE'][x.name+1:x.name+horizon+1].reset_index(drop=True)], axis=1)
[1] getting the index of a row in a pandas apply function
回答2:
You may find this useful as well.
keys = range(horizon + 1)
pd.concat([df.shift(-i) for i in keys], axis=1, keys=keys)
0 1 2 3
PRICE PRICE PRICE PRICE
0 100 200.0 300.0 400.0
1 200 300.0 400.0 500.0
2 300 400.0 500.0 600.0
3 400 500.0 600.0 700.0
4 500 600.0 700.0 800.0
5 600 700.0 800.0 900.0
6 700 800.0 900.0 1000.0
7 800 900.0 1000.0 NaN
8 900 1000.0 NaN NaN
9 1000 NaN NaN NaN
if you assign the concat
to df_c
keys = range(horizon + 1)
df_c = pd.concat([df.shift(-i) for i in keys], axis=1, keys=keys)
df_c.apply(lambda x: pd.Series([x[0].values, x[1:].values]), axis=1)
0 1
0 [100.0] [200.0, 300.0, 400.0]
1 [200.0] [300.0, 400.0, 500.0]
2 [300.0] [400.0, 500.0, 600.0]
3 [400.0] [500.0, 600.0, 700.0]
4 [500.0] [600.0, 700.0, 800.0]
5 [600.0] [700.0, 800.0, 900.0]
6 [700.0] [800.0, 900.0, 1000.0]
7 [800.0] [900.0, 1000.0, nan]
8 [900.0] [1000.0, nan, nan]
9 [1000.0] [nan, nan, nan]
来源:https://stackoverflow.com/questions/37149358/apply-function-to-pandas-dataframe-row-using-values-in-other-rows