问题
I have a DataFrame
on which I would like to somehow add new columns based on the value of a specific column, whose result depends on data contained in another DataFrame
.
More specifically, I have
df_original =
Crncy Spread Duration
0 EUR 100 1.2
1 nan nan nan
2 100 3.46
3 CHF 200 2.5
4 USD 50 5.0
...
df_interpolation =
CRNCY TENOR Adj_EUR Adj_USD
0 EUR 1 10 20
1 EUR 2 20 30
2 EUR 5 30 40
3 EUR 7 40 50
...
10 CHF 1 50 10
11 CHF 2 60 20
12 CHF 5 70 30
...
and would now like to add the columns Adj_EUR
and Adj_USD
to df_original
for each row, based on the value of Crncy
and Duration
using standard linear interpolation.
So, we want to use TENOR
and Adj_USD
/Adj_EUR
from df_interpolation
and Duration
from df_original
, for each available Crncy
, to form the interpolation.
E.g. Pseudo-code using optimize
-package from scipy
:
from scipy import optimize
""" Do this for both 'Adj_EUR' and 'Adj_USD' """
# For 'Adj_EUR'
for curr, df in df_original.groupby('Crncy'):
x_data = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['TENOR'])
y_data = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['Adj_EUR'])
""" Linear fit """
z_linear = optimize.curve_fit(lambda t,a,b: a + b * t, x_data.ravel(), y_data.ravel())[0]
""" Somehow add the values back to df_original in a new column """
df['Adj_EUR'] = z_linear[0] + z_linear[1] * df['Duration']
Yielding
Crncy Spread Duration Adj_EUR Adj_USD
0 EUR 100 1.2 12 22
1 nan nan nan 0.0 0.0
...
Any clue on how to do this?
Much appreciable
回答1:
Suppose we have df1
and df2
>>> df1
Crncy Spread Duration
0 EUR 100 1.2
1 CHF 200 2.5
>>> df2
CRNCY TENOR Adj_EUR Adj_USD
0 EUR 1 10 20
1 EUR 2 20 30
2 EUR 5 30 40
3 EUR 7 40 50
4 CHF 1 50 10
5 CHF 2 60 20
6 CHF 5 70 30
Transform df1
and df2
into similar dataframes
df1['Adj_EUR'] = np.nan
df1['Adj_USD'] = np.nan
df1['left'] = 1
>>> df1
Crncy Spread Duration Adj_EUR Adj_USD left
0 EUR 100 1.2 NaN NaN 1
1 CHF 200 2.5 NaN NaN 1
df2 = df2.rename(columns={'CRNCY': 'Crncy', 'TENOR': 'Duration'})
df2['Spread'] = np.nan
df2['left'] = 0
>>> df2
Crncy Duration Adj_EUR Adj_USD Spread left
0 EUR 1 10 20 NaN 0
1 EUR 2 20 30 NaN 0
2 EUR 5 30 40 NaN 0
3 EUR 7 40 50 NaN 0
4 CHF 1 50 10 NaN 0
5 CHF 2 60 20 NaN 0
6 CHF 5 70 30 NaN 0
Now concat df1
and df2
row direction.
df3 = pd.concat([df1, df2], ignore_index=True, sort=False).sort_values(['Crncy', 'Duration'])
>>> df3
Crncy Spread Duration Adj_EUR Adj_USD left
6 CHF NaN 1.0 50.0 10.0 0
7 CHF NaN 2.0 60.0 20.0 0
1 CHF 200.0 2.5 NaN NaN 1
8 CHF NaN 5.0 70.0 30.0 0
2 EUR NaN 1.0 10.0 20.0 0
0 EUR 100.0 1.2 NaN NaN 1
3 EUR NaN 2.0 20.0 30.0 0
4 EUR NaN 5.0 30.0 40.0 0
5 EUR NaN 7.0 40.0 50.0 0
And then interpolate NaN
values of each column using Duration
, and then drop unnecessary columns:
df3 = df3.set_index('Duration')
df4 = df3.groupby(['Crncy']).apply(lambda x: x.interpolate(method='index')).reset_index()
df4 = df4[['Crncy', 'Spread', 'Duration', 'Adj_EUR', 'Adj_USD', 'left']]
df4 = df4.loc[df4['left'] == 1].drop('left', axis=1).reset_index(drop=True)
>>> df4
Crncy Spread Duration Adj_EUR Adj_USD
0 CHF 200.0 2.5 61.666667 21.666667
1 EUR 100.0 1.2 12.000000 22.000000
Hope this helps.
回答2:
So, this is more what I was looking for:
from scipy import optimize
for curr, df in df_original.groupby('Crncy'):
x_data = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['TENOR'])
y_data_usd = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['Adj_USD'])
y_data_eur = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['Adj_EUR'])
""" Linear fit """
if x_data.size > 0:
z_linear_usd = optimize.curve_fit(lambda t,a,b: a + b * t, x_data.ravel(), y_data_usd.ravel())[0]
z_linear_eur = optimize.curve_fit(lambda t,a,b: a + b * t, x_data.ravel(), y_data_eur.ravel())[0]
temp_df = df.copy()[['Crncy','Duration']]
temp_df['Adj_USD'] = z_linear_usd[0] + z_linear_usd[1] * temp_df['OAD']
temp_df['Adj_EUR'] = z_linear_eur[0] + z_linear_eur[1] * temp_df['OAD']
temp_interpolation_lst.append(temp_df)
del temp_df
temp_interpolation_df = pd.concat(temp_interpolation_lst)
temp_interpolation_df.sort_index(axis=0, inplace=True)
""" Add back to original DataFrame - as the indices are the same and matching..."""
df_original = df_original .join(other=temp_interpolation_df[['Adj_USD', 'Adj_EUR']], how='left')
It is not as clean as I've hoped, but still seems to work...
来源:https://stackoverflow.com/questions/51838355/applying-interpolation-on-dataframe-based-on-another-dataframe