问题
I have a dataframe as given below
data_file= pd.DataFrame({'person_id':[1,1,1,1,2,2,2,3,3,3],'ob.date': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'observation': ['Age','interviewdate','marital_status','interviewdate','Age','interviewdate','marital_status','Age','interviewdate','marital_status'],
'answer': [21,'21/08/2017','Single','22/05/2217', 26,'11/03/2010','Single',41,'31/09/2012','Married']
})
What I would like to do is, fetch the date values
from answer
column and put it in ob.date
column. The dataframe provided shows that person_id =1
answered question about Age on 21/08/2017
and on 22/05/2017
he answered question about marital_status
This is what I tried based on SO suggestion from another post
s = data_file[(data_file.observation == 'interviewdate')].set_index('person_id')['answer']
data_file['ob.date'] = data_file['person_id'].map(s)
But this does not work as I get duplicate index error
. How can I avoid that issue and make it efficient enough?
So any elegant and efficient solution would be helpful. Person_id = 1 has two date values, so fill all rows above interviewdate
observation with the value from answer
column (of interviewdate
observation)
How can I expect my output to be like this?
回答1:
All depends of data - first set new column by answer
by condition and then per groups repalce missing values by back and forward filling:
data_file['ob.date'] = data_file.loc[(data_file.observation == 'interviewdate'), 'answer']
data_file['ob.date'] = (data_file.groupby('person_id')['ob.date']
.apply(lambda x: x.bfill().ffill()))
print (data_file)
person_id ob.date observation answer
0 1 21/08/2017 Age 21
1 1 21/08/2017 interviewdate 21/08/2017
2 1 22/05/2217 marital_status Single
3 1 22/05/2217 interviewdate 22/05/2217
4 2 11/03/2010 Age 26
5 2 11/03/2010 interviewdate 11/03/2010
6 2 11/03/2010 marital_status Single
7 3 31/09/2012 Age 41
8 3 31/09/2012 interviewdate 31/09/2012
9 3 31/09/2012 marital_status Married
Details:
First is used back fiiling per groups, because interviewdate
are edge rows - all values before are same subgroups. Last is add forwrd filling for repalce last NaNs per groups - not replaced by bfill
:
data_file['ob.date'] = (data_file.groupby('person_id')['ob.date']
.apply(lambda x: x.bfill()))
print (data_file)
person_id ob.date observation answer
0 1 21/08/2017 Age 21
1 1 21/08/2017 interviewdate 21/08/2017
2 1 22/05/2217 marital_status Single
3 1 22/05/2217 interviewdate 22/05/2217
4 2 11/03/2010 Age 26
5 2 11/03/2010 interviewdate 11/03/2010
6 2 NaN marital_status Single
7 3 31/09/2012 Age 41
8 3 31/09/2012 interviewdate 31/09/2012
9 3 NaN marital_status Married
来源:https://stackoverflow.com/questions/57475086/elegant-way-to-fill-in-a-column-with-row-values-based-on-groups-in-pandas