问题
I have the following database that are extracted with pandas from csv files :
df1=pd.read_csv(path,parse_dates=True)
The print of df1 gives :
control Avg_return
2019-09-07 True 0
2019-06-06 True 0
2019-02-19 True 0
2019-01-17 True 0
2018-12-20 True 0
2018-11-27 True 0
2018-10-12 True 0
... ... ...
After I load the 2 csv file
df2=pd.read_csv(path,parse_dates=True)
The print of df2 gives :
return
2010-01-01 NaN
2010-04-01 0.010920
2010-05-01 -0.004404
2010-06-01 -0.025209
2010-07-01 -0.023280
... ...
The aim of my code is :
- Take a date from df1
- Subtract 6 days from the date taken in point 1.
- Subtract 244 days from the date taken in point 1.
- Take all the return from this two date in df2
- Compute the mean of these return and stock it in Avg_return
I did this :
for i in range(0,df1_row):
#I go through my data df1
if (control.iloc[i]==True):
#I check if control_1 is true
date_1=df1.index[i]-pd.to_timedelta(6, unit='d')
# I remove 6 days from my date
date_2=df1.index[i]-pd.to_timedelta(244, unit='d')
# I remove 244 days from my date
df1.loc[i,"Average_return"] = df2[[date_1:date_2],["return"]].mean()
# I want to make the mean of the return between my date-6 days and my date-244 days
Unfortunately it gives me this error :
df1.loc[i,"Average_return"] = df2[[date1:date2],["return"]].mean()
^
SyntaxError: invalid syntax
Is someone able to help me? :)
回答1:
The following looks a bit ugly, but I think it works :)
Dummy df's:
import numpy as np
import pandas as pd
cols = ['date', 'control', 'Avg_return']
data = [
[pd.to_datetime('2019-09-07'), True, 0],
[pd.to_datetime('2019-06-06'), True, 0]
]
df1 = pd.DataFrame(data, columns=cols)
cols2 = ['date', 'return']
data2 = [
[pd.to_datetime('2010-01-01'), np.nan],
[pd.to_datetime('2010-04-01'), 0.010920],
[pd.to_datetime('2019-09-01'), 1]
]
df2 = pd.DataFrame(data2, columns=cols2)
Drafted solution:
df1['date_minus_6'] = df1['date'] - dt.timedelta(days=6)
df1['date_minus_244'] = df1['date'] - dt.timedelta(days=244)
for i in range(0, df1.shape[0]):
for j in range(0, df2.shape[0]):
if df2['date'].iloc[j] == df1['date_minus_6'].iloc[i]:
df1['Avg_return'].iloc[i] = (
df1['Avg_return'].iloc[i] + df2['return'].iloc[j]
).mean()
elif df2['date'].iloc[j] == df1['date_minus_244'].iloc[i]:
df1['Avg_return'].iloc[i] = (
df1['Avg_return'].iloc[i] + df2['return'].iloc[j]
).mean()
Output:
date control Avg_return date_minus_6 date_minus_244
0 2019-09-07 True 1.0 2019-09-01 2019-01-06
1 2019-06-06 True 0.0 2019-05-31 2018-10-05
回答2:
import csv
import pandas as pd
df1=pd.read_csv('dsf1.csv',parse_dates=True)
df2=pd.read_csv('dsf2.csv',parse_dates=True)
df1.columns = ['date', 'control', 'return']
df2.columns = ['date', 'return']
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
for i in range(0, df1.shape[0]):
if df1['control'][i] == True:
date_1 = df1['date'][0] - pd.to_timedelta(6, unit='d')
date_2 = df2['date'][0] - pd.to_timedelta(244, unit='d')
#I'm not sure if average_return has the correct condition, but adjust as you see fit
df1.loc[i, 'average_return'] = (df1[df1['date'] > date_1]['return'] - df2[df2['date'] > date_2]['return']).mean()
print df1
回答3:
This is a different approach without looping over all rows:
# make sure your index is a datetime index
df1.index = pd.to_datetime(df1.index)
df1['date_1'] = df1.index - pd.to_timedelta(6, unit='d')
df1['date_2'] = df1.index - pd.to_timedelta(244, unit='d')
df1['Average_return'] = df1.apply(lambda r: df2.loc[r['date_1']: r['date_2'], 'return'].mean(), axis=1)
来源:https://stackoverflow.com/questions/58975665/problem-in-the-selection-of-a-part-of-data-in-pandas