Applying interpolation on DataFrame based on another DataFrame

巧了我就是萌 提交于 2021-02-08 10:29:10

问题


I have a DataFrame on which I would like to somehow add new columns based on the value of a specific column, whose result depends on data contained in another DataFrame.

More specifically, I have

df_original = 

    Crncy  Spread  Duration
0   EUR    100     1.2
1   nan    nan     nan
2          100     3.46
3   CHF    200     2.5
4   USD    50      5.0
...

df_interpolation = 

    CRNCY  TENOR   Adj_EUR   Adj_USD
0   EUR    1       10        20    
1   EUR    2       20        30  
2   EUR    5       30        40  
3   EUR    7       40        50  
...
10  CHF    1       50        10  
11  CHF    2       60        20  
12  CHF    5       70        30  
...

and would now like to add the columns Adj_EUR and Adj_USD to df_original for each row, based on the value of Crncy and Duration using standard linear interpolation.

So, we want to use TENOR and Adj_USD/Adj_EUR from df_interpolation and Duration from df_original, for each available Crncy, to form the interpolation.

E.g. Pseudo-code using optimize-package from scipy:

from scipy import optimize

""" Do this for both 'Adj_EUR' and 'Adj_USD' """

# For 'Adj_EUR'
for curr, df in df_original.groupby('Crncy'):

    x_data = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['TENOR'])
    y_data = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['Adj_EUR'])

    """ Linear fit """
    z_linear = optimize.curve_fit(lambda t,a,b: a + b * t, x_data.ravel(), y_data.ravel())[0]
    """ Somehow add the values back to df_original in a new column """
    df['Adj_EUR'] = z_linear[0] + z_linear[1] * df['Duration']

Yielding

    Crncy  Spread  Duration  Adj_EUR  Adj_USD
0   EUR    100     1.2       12       22
1   nan    nan     nan       0.0      0.0
...

Any clue on how to do this?

Much appreciable


回答1:


Suppose we have df1 and df2

>>> df1
  Crncy  Spread  Duration
0   EUR     100       1.2
1   CHF     200       2.5


>>> df2
  CRNCY  TENOR  Adj_EUR  Adj_USD
0   EUR      1       10       20
1   EUR      2       20       30
2   EUR      5       30       40
3   EUR      7       40       50
4   CHF      1       50       10
5   CHF      2       60       20
6   CHF      5       70       30

Transform df1 and df2 into similar dataframes

df1['Adj_EUR'] = np.nan
df1['Adj_USD'] = np.nan
df1['left'] = 1

>>> df1
  Crncy  Spread  Duration  Adj_EUR  Adj_USD  left
0   EUR     100       1.2      NaN      NaN     1
1   CHF     200       2.5      NaN      NaN     1

df2 = df2.rename(columns={'CRNCY': 'Crncy', 'TENOR': 'Duration'})
df2['Spread'] = np.nan
df2['left'] = 0

>>> df2
  Crncy  Duration  Adj_EUR  Adj_USD  Spread  left
0   EUR         1       10       20     NaN     0
1   EUR         2       20       30     NaN     0
2   EUR         5       30       40     NaN     0
3   EUR         7       40       50     NaN     0
4   CHF         1       50       10     NaN     0
5   CHF         2       60       20     NaN     0
6   CHF         5       70       30     NaN     0

Now concat df1 and df2 row direction.

df3 = pd.concat([df1, df2], ignore_index=True, sort=False).sort_values(['Crncy', 'Duration'])

>>> df3
  Crncy  Spread  Duration  Adj_EUR  Adj_USD  left
6   CHF     NaN       1.0     50.0     10.0     0
7   CHF     NaN       2.0     60.0     20.0     0
1   CHF   200.0       2.5      NaN      NaN     1
8   CHF     NaN       5.0     70.0     30.0     0
2   EUR     NaN       1.0     10.0     20.0     0
0   EUR   100.0       1.2      NaN      NaN     1
3   EUR     NaN       2.0     20.0     30.0     0
4   EUR     NaN       5.0     30.0     40.0     0
5   EUR     NaN       7.0     40.0     50.0     0

And then interpolate NaN values of each column using Duration, and then drop unnecessary columns:

df3 = df3.set_index('Duration')
df4 = df3.groupby(['Crncy']).apply(lambda x: x.interpolate(method='index')).reset_index()
df4 = df4[['Crncy', 'Spread', 'Duration', 'Adj_EUR', 'Adj_USD', 'left']]
df4 = df4.loc[df4['left'] == 1].drop('left', axis=1).reset_index(drop=True)

>>> df4
  Crncy  Spread  Duration    Adj_EUR    Adj_USD
0   CHF   200.0       2.5  61.666667  21.666667
1   EUR   100.0       1.2  12.000000  22.000000

Hope this helps.




回答2:


So, this is more what I was looking for:

from scipy import optimize
for curr, df in df_original.groupby('Crncy'):

    x_data = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['TENOR'])
    y_data_usd = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['Adj_USD'])
    y_data_eur = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['Adj_EUR'])

    """ Linear fit """
    if x_data.size > 0:
        z_linear_usd = optimize.curve_fit(lambda t,a,b: a + b * t, x_data.ravel(), y_data_usd.ravel())[0]
        z_linear_eur = optimize.curve_fit(lambda t,a,b: a + b * t, x_data.ravel(), y_data_eur.ravel())[0]

    temp_df = df.copy()[['Crncy','Duration']]
    temp_df['Adj_USD'] = z_linear_usd[0] + z_linear_usd[1] * temp_df['OAD']
    temp_df['Adj_EUR'] = z_linear_eur[0] + z_linear_eur[1] * temp_df['OAD']

    temp_interpolation_lst.append(temp_df)
    del temp_df

temp_interpolation_df = pd.concat(temp_interpolation_lst)
temp_interpolation_df.sort_index(axis=0, inplace=True)

""" Add back to original DataFrame - as the indices are the same and matching..."""
df_original = df_original .join(other=temp_interpolation_df[['Adj_USD', 'Adj_EUR']], how='left')

It is not as clean as I've hoped, but still seems to work...



来源:https://stackoverflow.com/questions/51838355/applying-interpolation-on-dataframe-based-on-another-dataframe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!