create a new column based on groupby date time column at date level in pandas

问题

I have data frame as shown below.

Doctor       Appointment           Booking_ID   
  A          2020-01-18 12:00:00     1 
  A          2020-01-18 12:30:00     2
  A          2020-01-18 13:00:00     3 
  A          2020-01-18 13:00:00     4
  A          2020-01-19 13:00:00     13
  A          2020-01-19 13:30:00     14 
  B          2020-01-18 12:00:00     5 
  B          2020-01-18 12:30:00     6 
  B          2020-01-18 13:00:00     7
  B          2020-01-25 12:30:00     6 
  B          2020-01-25 13:00:00     7
  C          2020-01-19 12:00:00     19 
  C          2020-01-19 12:30:00     20
  C          2020-01-19 13:00:00     21
  C          2020-01-22 12:30:00     20
  C          2020-01-22 13:00:00     21

From the above I would like to create a column called Session as shown below.

Expected Output:

Doctor       Appointment           Booking_ID   Session
  A          2020-01-18 12:00:00     1          S1
  A          2020-01-18 12:30:00     2          S1
  A          2020-01-18 13:00:00     3          S1
  A          2020-01-18 13:00:00     4          S1
  A          2020-01-29 13:00:00     13         S2
  A          2020-01-29 13:30:00     14         S2
  B          2020-01-18 12:00:00     5          S3
  B          2020-01-18 12:30:00     6          S3
  B          2020-01-18 13:00:00     17         S3
  B          2020-01-25 12:30:00     16         S4
  B          2020-01-25 13:00:00     7          S4
  C          2020-01-19 12:00:00     19         S5
  C          2020-01-19 12:30:00     20         S5
  C          2020-01-19 13:00:00     21         S5
  C          2020-01-22 12:30:00     29         S6
  C          2020-01-22 13:00:00     26         S6
  C          2020-01-22 13:30:00     24         S6

Session should be different for different doctor and different Appointment date(in day level)

I tried below

df = df.sort_values(['Doctor', 'Appointment'], ascending=True)


df['Appointment'] = pd.to_datetime(df['Appointment'])
dates = df['Appointment'].dt.date

df['Session'] = 'S' + pd.Series(dates.factorize()[0] + 1, index=df.index).astype(str)

But it is considering session based on only dates. I would like to consider doctor as well.

回答1:

you can go with sort_values and check where either the diff in date is not 0 or the doctor not the same than previous row with shift like:

df = df.sort_values(['Doctor', 'Appointment'], ascending=True)
df['Session'] = 'S'+(df['Appointment'].dt.date.diff().ne(pd.Timedelta(days=0))
                     |df['Doctor'].ne(df['Doctor'].shift())).cumsum().astype(str)
print (df)
   Doctor         Appointment  Booking_ID Session
0       A 2020-01-18 12:00:00           1      S1
1       A 2020-01-18 12:30:00           2      S1
2       A 2020-01-18 13:00:00           3      S1
3       A 2020-01-18 13:00:00           4      S1
4       A 2020-01-19 13:00:00          13      S2
5       A 2020-01-19 13:30:00          14      S2
6       B 2020-01-18 12:00:00           5      S3
7       B 2020-01-18 12:30:00           6      S3
8       B 2020-01-18 13:00:00           7      S3
9       B 2020-01-25 12:30:00           6      S4
10      B 2020-01-25 13:00:00           7      S4
11      C 2020-01-19 12:00:00          19      S5
12      C 2020-01-19 12:30:00          20      S5
13      C 2020-01-19 13:00:00          21      S5
14      C 2020-01-22 12:30:00          20      S6
15      C 2020-01-22 13:00:00          21      S6

回答2:

IIUC, Groupby.ngroup with Series.dt.date

df['Session'] = 'S' + (df.groupby(['Doctor',pd.to_datetime(df['Appointment']).dt.date])
                         .ngroup()
                         .add(1).astype(str))

   Doctor          Appointment  Booking_ID Session
0       A  2020-01-18-12:00:00           1      S1
1       A  2020-01-18-12:30:00           2      S1
2       A  2020-01-18-13:00:00           3      S1
3       A  2020-01-18-13:00:00           4      S1
4       A  2020-01-19-13:00:00          13      S2
5       A  2020-01-19-13:30:00          14      S2
6       B  2020-01-18-12:00:00           5      S3
7       B  2020-01-18-12:30:00           6      S3
8       B  2020-01-18-13:00:00           7      S3
9       B  2020-01-25-12:30:00           6      S4
10      B  2020-01-25-13:00:00           7      S4
11      C  2020-01-19-12:00:00          19      S5
12      C  2020-01-19-12:30:00          20      S5
13      C  2020-01-19-13:00:00          21      S5
14      C  2020-01-22-12:30:00          20      S6
15      C  2020-01-22-13:00:00          21      S6

回答3:

IIUC, this is groupby().numgroup():

# convert to datetime
df.Appointment = pd.to_datetime(df.Appointment)

df['Session'] = 'S' + (df.groupby(['Doctor', df.Appointment.dt.date]).ngroup()+1).astype(str)

Output:

   Doctor         Appointment  Booking_ID Session
0       A 2020-01-18 12:00:00           1      S1
1       A 2020-01-18 12:30:00           2      S1
2       A 2020-01-18 13:00:00           3      S1
3       A 2020-01-18 13:00:00           4      S1
4       A 2020-01-19 13:00:00          13      S2
5       A 2020-01-19 13:30:00          14      S2
6       B 2020-01-18 12:00:00           5      S3
7       B 2020-01-18 12:30:00           6      S3
8       B 2020-01-18 13:00:00           7      S3
9       B 2020-01-25 12:30:00           6      S4
10      B 2020-01-25 13:00:00           7      S4
11      C 2020-01-19 12:00:00          19      S5
12      C 2020-01-19 12:30:00          20      S5
13      C 2020-01-19 13:00:00          21      S5
14      C 2020-01-22 12:30:00          20      S6
15      C 2020-01-22 13:00:00          21      S6

回答4:

Another approach using idxmin with a slightly different result:

df['Session'] = 'S' + (df.groupby(
    ['Doctor', df.Appointment.dt.date]
).transform('idxmin').iloc[:,0]+1).astype('str')

来源：https://stackoverflow.com/questions/61463502/create-a-new-column-based-on-groupby-date-time-column-at-date-level-in-pandas

标签

pandas

pandas-groupby