Determining group size based entry and exit times of IDs in my df

问题

This gives me the entry and exit times of IDs in my df:

minmax = merged_df.groupby(['id'])['date'].agg([min, max])

result

id   min                 max

4900 2019-09-17 08:43:06 2019-09-17 09:38:20
4909 2019-09-17 08:43:06 2019-09-17 09:16:00
4911 2019-09-17 08:43:06 2019-09-17 09:43:58
4965 2019-09-17 09:27:14 2019-09-17 09:38:28
5134 2019-09-17 09:34:26 2019-09-17 09:38:27
5139 2019-09-17 09:37:03 2019-09-17 09:46:19
5141 2019-09-17 09:37:22 2019-09-17 12:06:30
5163 2019-09-17 09:38:03 2019-09-17 10:18:29
5170 2019-09-17 09:38:19 2019-09-17 12:47:49

This is how my DF structure looks like:

df

date                    x   y   id  

2019-09-17 08:43:06     206 210 4900
2019-09-17 08:43:06     234 236 4909
2019-09-17 08:43:06     251 222 4911
2019-09-17 08:43:07     231 244 4909
2019-09-17 08:43:07     252 222 4911
2019-09-17 08:43:07     207 210 4965
2019-09-17 08:43:08     234 250 5163
2019-09-17 08:43:08     252 222 4911
2019-09-17 08:43:08     206 210 4900
2019-09-17 08:43:09     252 222 4911
2019-09-17 08:43:09     206 210 4900
2019-09-17 08:43:09     223 247 4909
2019-09-17 08:43:10     206 210 4900
2019-09-17 08:43:10     229 237 4909
2019-09-17 08:43:10     252 222 4911
2019-09-17 08:43:12     226 241 4909

How can i create a new column in my DF that compares the entry points of the IDs in a given second and if they appeared in a same time range (for example same minute), then I would like to get something like to insert the groupsize, something like this:

df

date                    x   y   id      groupsize

2019-09-17 08:43:06     206 210 4900    3
2019-09-17 08:43:06     234 236 4909    3
2019-09-17 08:43:06     251 222 4911    3
2019-09-17 08:43:07     231 244 4909    3
2019-09-17 08:43:07     252 222 4911    3
2019-09-17 08:43:07     207 210 4965    1
2019-09-17 08:43:08     234 250 5134    1
2019-09-17 08:43:08     252 222 5139    2
2019-09-17 08:43:08     206 210 4900    3
2019-09-17 08:43:09     252 222 4911    3
2019-09-17 08:43:09     206 210 4900    3
2019-09-17 08:43:09     223 247 4909    3
2019-09-17 08:43:10     206 210 5141    2
2019-09-17 08:43:10     229 237 4909    3
2019-09-17 08:43:10     252 222 5163    2
2019-09-17 08:43:12     226 241 5170    2

How can i do this? Is this something anyone can help me out with?

I appreciate any hint!

回答1:

IIUC, first lets merge the min and max values onto your data frame structure

import pandas as pd
import numpy as np
df3 = pd.merge(df,minmax,on='id',how='left')
                  date    x    y    id                 min                 max
0  2019-09-17 08:43:06  206  210  4900 2019-09-17 08:43:06 2019-09-17 09:38:20
1  2019-09-17 08:43:06  234  236  4909 2019-09-17 08:43:06 2019-09-17 09:16:00
2  2019-09-17 08:43:06  251  222  4911 2019-09-17 08:43:06 2019-09-17 09:43:58
3  2019-09-17 08:43:07  231  244  4909 2019-09-17 08:43:06 2019-09-17 09:16:00
4  2019-09-17 08:43:07  252  222  4911 2019-09-17 08:43:06 2019-09-17 09:43:58
5  2019-09-17 08:43:07  207  210  4965 2019-09-17 09:27:14 2019-09-17 09:38:28
6  2019-09-17 08:43:08  234  250  5163 2019-09-17 09:38:03 2019-09-17 10:18:29
7  2019-09-17 08:43:08  252  222  4911 2019-09-17 08:43:06 2019-09-17 09:43:58
8  2019-09-17 08:43:08  206  210  4900 2019-09-17 08:43:06 2019-09-17 09:38:20
9  2019-09-17 08:43:09  252  222  4911 2019-09-17 08:43:06 2019-09-17 09:43:58
10 2019-09-17 08:43:09  206  210  4900 2019-09-17 08:43:06 2019-09-17 09:38:20
11 2019-09-17 08:43:09  223  247  4909 2019-09-17 08:43:06 2019-09-17 09:16:00
12 2019-09-17 08:43:10  206  210  4900 2019-09-17 08:43:06 2019-09-17 09:38:20
13 2019-09-17 08:43:10  229  237  4909 2019-09-17 08:43:06 2019-09-17 09:16:00
14 2019-09-17 08:43:10  252  222  4911 2019-09-17 08:43:06 2019-09-17 09:43:58
15 2019-09-17 08:43:12  226  241  4909 2019-09-17 08:43:06 2019-09-17 09:16:00

then let's work out the absolute sum of seconds between the date and the min value. if you need the actual value, you can read in the values as is, but you'll need to add in more logical steps to handle negative values.

s = abs(df3['min'] - df3['date']) / np.timedelta64(1,'s') 
print(s)
0        0.0
1        0.0
2        0.0
3        1.0
4        1.0
5     2647.0
6     3295.0
7        2.0
8        2.0
9        3.0
10       3.0
11       3.0
12       4.0
13       4.0
14       4.0
15       6.0
dtype: float64

you can do this a number of ways, but i'll just use .loc to set your values in order.

df3.loc[s <= 3, 'GroupSize'] = 3
df3.loc[(s > 3) & (s <= 7), 'GroupSize'] = 2
df3.loc[s > 7, 'GroupSize'] = 1

print(df3[['id','date','x','y','min','GroupSize']])
          id                date    x    y                 min  GroupSize
0   4900 2019-09-17 08:43:06  206  210 2019-09-17 08:43:06        3.0
1   4909 2019-09-17 08:43:06  234  236 2019-09-17 08:43:06        3.0
2   4911 2019-09-17 08:43:06  251  222 2019-09-17 08:43:06        3.0
3   4909 2019-09-17 08:43:07  231  244 2019-09-17 08:43:06        3.0
4   4911 2019-09-17 08:43:07  252  222 2019-09-17 08:43:06        3.0
5   4965 2019-09-17 08:43:07  207  210 2019-09-17 09:27:14        1.0
6   5163 2019-09-17 08:43:08  234  250 2019-09-17 09:38:03        1.0
7   4911 2019-09-17 08:43:08  252  222 2019-09-17 08:43:06        3.0
8   4900 2019-09-17 08:43:08  206  210 2019-09-17 08:43:06        3.0
9   4911 2019-09-17 08:43:09  252  222 2019-09-17 08:43:06        3.0
10  4900 2019-09-17 08:43:09  206  210 2019-09-17 08:43:06        3.0
11  4909 2019-09-17 08:43:09  223  247 2019-09-17 08:43:06        3.0
12  4900 2019-09-17 08:43:10  206  210 2019-09-17 08:43:06        2.0
13  4909 2019-09-17 08:43:10  229  237 2019-09-17 08:43:06        2.0
14  4911 2019-09-17 08:43:10  252  222 2019-09-17 08:43:06        2.0
15  4909 2019-09-17 08:43:12  226  241 2019-09-17 08:43:06        2.0

来源：https://stackoverflow.com/questions/59465127/determining-group-size-based-entry-and-exit-times-of-ids-in-my-df

标签

python

pandas

datetime

group-by