问题
I have data frame as shown below.
Doctor Appointment Booking_ID
A 2020-01-18 12:00:00 1
A 2020-01-18 12:30:00 2
A 2020-01-18 13:00:00 3
A 2020-01-18 13:00:00 4
A 2020-01-19 13:00:00 13
A 2020-01-19 13:30:00 14
B 2020-01-18 12:00:00 5
B 2020-01-18 12:30:00 6
B 2020-01-18 13:00:00 7
B 2020-01-25 12:30:00 6
B 2020-01-25 13:00:00 7
C 2020-01-19 12:00:00 19
C 2020-01-19 12:30:00 20
C 2020-01-19 13:00:00 21
C 2020-01-22 12:30:00 20
C 2020-01-22 13:00:00 21
From the above I would like to create a column called Session as shown below.
Expected Output:
Doctor Appointment Booking_ID Session
A 2020-01-18 12:00:00 1 S1
A 2020-01-18 12:30:00 2 S1
A 2020-01-18 13:00:00 3 S1
A 2020-01-18 13:00:00 4 S1
A 2020-01-29 13:00:00 13 S2
A 2020-01-29 13:30:00 14 S2
B 2020-01-18 12:00:00 5 S3
B 2020-01-18 12:30:00 6 S3
B 2020-01-18 13:00:00 17 S3
B 2020-01-25 12:30:00 16 S4
B 2020-01-25 13:00:00 7 S4
C 2020-01-19 12:00:00 19 S5
C 2020-01-19 12:30:00 20 S5
C 2020-01-19 13:00:00 21 S5
C 2020-01-22 12:30:00 29 S6
C 2020-01-22 13:00:00 26 S6
C 2020-01-22 13:30:00 24 S6
Session should be different for different doctor and different Appointment date(in day level)
I tried below
df = df.sort_values(['Doctor', 'Appointment'], ascending=True)
df['Appointment'] = pd.to_datetime(df['Appointment'])
dates = df['Appointment'].dt.date
df['Session'] = 'S' + pd.Series(dates.factorize()[0] + 1, index=df.index).astype(str)
But it is considering session based on only dates. I would like to consider doctor as well.
回答1:
you can go with sort_values
and check where either the diff
in date is not 0 or the doctor not the same than previous row with shift
like:
df = df.sort_values(['Doctor', 'Appointment'], ascending=True)
df['Session'] = 'S'+(df['Appointment'].dt.date.diff().ne(pd.Timedelta(days=0))
|df['Doctor'].ne(df['Doctor'].shift())).cumsum().astype(str)
print (df)
Doctor Appointment Booking_ID Session
0 A 2020-01-18 12:00:00 1 S1
1 A 2020-01-18 12:30:00 2 S1
2 A 2020-01-18 13:00:00 3 S1
3 A 2020-01-18 13:00:00 4 S1
4 A 2020-01-19 13:00:00 13 S2
5 A 2020-01-19 13:30:00 14 S2
6 B 2020-01-18 12:00:00 5 S3
7 B 2020-01-18 12:30:00 6 S3
8 B 2020-01-18 13:00:00 7 S3
9 B 2020-01-25 12:30:00 6 S4
10 B 2020-01-25 13:00:00 7 S4
11 C 2020-01-19 12:00:00 19 S5
12 C 2020-01-19 12:30:00 20 S5
13 C 2020-01-19 13:00:00 21 S5
14 C 2020-01-22 12:30:00 20 S6
15 C 2020-01-22 13:00:00 21 S6
回答2:
IIUC, Groupby.ngroup with Series.dt.date
df['Session'] = 'S' + (df.groupby(['Doctor',pd.to_datetime(df['Appointment']).dt.date])
.ngroup()
.add(1).astype(str))
Doctor Appointment Booking_ID Session
0 A 2020-01-18-12:00:00 1 S1
1 A 2020-01-18-12:30:00 2 S1
2 A 2020-01-18-13:00:00 3 S1
3 A 2020-01-18-13:00:00 4 S1
4 A 2020-01-19-13:00:00 13 S2
5 A 2020-01-19-13:30:00 14 S2
6 B 2020-01-18-12:00:00 5 S3
7 B 2020-01-18-12:30:00 6 S3
8 B 2020-01-18-13:00:00 7 S3
9 B 2020-01-25-12:30:00 6 S4
10 B 2020-01-25-13:00:00 7 S4
11 C 2020-01-19-12:00:00 19 S5
12 C 2020-01-19-12:30:00 20 S5
13 C 2020-01-19-13:00:00 21 S5
14 C 2020-01-22-12:30:00 20 S6
15 C 2020-01-22-13:00:00 21 S6
回答3:
IIUC, this is groupby().numgroup()
:
# convert to datetime
df.Appointment = pd.to_datetime(df.Appointment)
df['Session'] = 'S' + (df.groupby(['Doctor', df.Appointment.dt.date]).ngroup()+1).astype(str)
Output:
Doctor Appointment Booking_ID Session
0 A 2020-01-18 12:00:00 1 S1
1 A 2020-01-18 12:30:00 2 S1
2 A 2020-01-18 13:00:00 3 S1
3 A 2020-01-18 13:00:00 4 S1
4 A 2020-01-19 13:00:00 13 S2
5 A 2020-01-19 13:30:00 14 S2
6 B 2020-01-18 12:00:00 5 S3
7 B 2020-01-18 12:30:00 6 S3
8 B 2020-01-18 13:00:00 7 S3
9 B 2020-01-25 12:30:00 6 S4
10 B 2020-01-25 13:00:00 7 S4
11 C 2020-01-19 12:00:00 19 S5
12 C 2020-01-19 12:30:00 20 S5
13 C 2020-01-19 13:00:00 21 S5
14 C 2020-01-22 12:30:00 20 S6
15 C 2020-01-22 13:00:00 21 S6
回答4:
Another approach using idxmin
with a slightly different result:
df['Session'] = 'S' + (df.groupby(
['Doctor', df.Appointment.dt.date]
).transform('idxmin').iloc[:,0]+1).astype('str')
来源:https://stackoverflow.com/questions/61463502/create-a-new-column-based-on-groupby-date-time-column-at-date-level-in-pandas