问题
I have a dataframe like as given below
edited dataframe
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06 13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df_yes['time_1'])
df['day'] = df['time_1'].dt.day
What I would like to do is create a new record
As shown in the below screenshot, you can see that for subject_id = 1
, his record for 4th
day is missing. So what I am trying to do is `copy the immediate preceding row
I tried below but didn't help
df.groupby('subject_id)['day'].eq(df['day'].shift(-1)).add(1)
The new record should have the same content as the previous row but just the date value should be modified (d+1
) like as shown below
I expect my output to be like as shown below for each subject_id
. You can see how new record for day 4 is added
. please note that time component of a new row doesn't really matter. it can be anything (00:00:00
).
I only wish to add missing dates between a range in a month. For example subject = 1, in the 4th month has records from 3rd to 5th. but 4th is missing. So we add record for 4th day alone. We don't need 6th,7th etc
edited output
回答1:
There are duplicated date
s after remove times, so you can create helper DataFrame with all dates per subject_id
:
df1 = (df.set_index('date')
.groupby('subject_id')
.resample('d')
.last()
.index
.to_frame(index=False))
print (df1)
subject_id date
0 1 2173-04-03
1 1 2173-04-04
2 1 2173-04-05
3 1 2173-04-06
4 2 2173-04-08
5 2 2173-04-09
6 2 2173-04-10
7 2 2173-04-11
8 2 2173-04-12
9 2 2173-04-13
10 2 2173-04-14
Then use DataFrame.merge with left join and forward filling missing values:
df2 = df1.merge(df, how='left').groupby('subject_id', as_index=False).ffill()
Last is necessary add days to new added datetimes, one possible solution is add timedeltas created by difference between new time_1
values with date
s:
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
date time_1 val day
0 2173-04-03 2173-04-03 12:35:00 5 3
1 2173-04-03 2173-04-03 12:50:00 5 3
2 2173-04-03 2173-04-03 12:59:00 5 3
3 2173-04-04 2173-04-04 13:14:00 5 4
4 2173-04-04 2173-04-04 13:37:00 1 4
5 2173-04-05 2173-04-05 13:37:00 1 5
6 2173-04-06 2173-04-06 13:39:00 6 6
7 2173-04-06 2173-04-06 11:30:00 5 6
8 2173-04-08 2173-04-08 16:00:00 5 8
9 2173-04-09 2173-04-09 22:00:00 8 9
10 2173-04-10 2173-04-10 22:00:00 8 10
11 2173-04-11 2173-04-11 04:00:00 3 11
12 2173-04-12 2173-04-12 04:00:00 3 12
13 2173-04-13 2173-04-13 04:30:00 4 13
14 2173-04-14 2173-04-14 08:00:00 6 14
来源:https://stackoverflow.com/questions/57784410/how-to-create-a-new-row-on-the-fly-by-copying-previous-row