问题
This gives me the entry and exit times of IDs in my df:
minmax = merged_df.groupby(['id'])['date'].agg([min, max])
result
id min max
4900 2019-09-17 08:43:06 2019-09-17 09:38:20
4909 2019-09-17 08:43:06 2019-09-17 09:16:00
4911 2019-09-17 08:43:06 2019-09-17 09:43:58
4965 2019-09-17 09:27:14 2019-09-17 09:38:28
5134 2019-09-17 09:34:26 2019-09-17 09:38:27
5139 2019-09-17 09:37:03 2019-09-17 09:46:19
5141 2019-09-17 09:37:22 2019-09-17 12:06:30
5163 2019-09-17 09:38:03 2019-09-17 10:18:29
5170 2019-09-17 09:38:19 2019-09-17 12:47:49
This is how my DF structure looks like:
df
date x y id
2019-09-17 08:43:06 206 210 4900
2019-09-17 08:43:06 234 236 4909
2019-09-17 08:43:06 251 222 4911
2019-09-17 08:43:07 231 244 4909
2019-09-17 08:43:07 252 222 4911
2019-09-17 08:43:07 207 210 4965
2019-09-17 08:43:08 234 250 5163
2019-09-17 08:43:08 252 222 4911
2019-09-17 08:43:08 206 210 4900
2019-09-17 08:43:09 252 222 4911
2019-09-17 08:43:09 206 210 4900
2019-09-17 08:43:09 223 247 4909
2019-09-17 08:43:10 206 210 4900
2019-09-17 08:43:10 229 237 4909
2019-09-17 08:43:10 252 222 4911
2019-09-17 08:43:12 226 241 4909
How can i create a new column in my DF that compares the entry points of the IDs in a given second and if they appeared in a same time range (for example same minute), then I would like to get something like to insert the groupsize, something like this:
df
date x y id groupsize
2019-09-17 08:43:06 206 210 4900 3
2019-09-17 08:43:06 234 236 4909 3
2019-09-17 08:43:06 251 222 4911 3
2019-09-17 08:43:07 231 244 4909 3
2019-09-17 08:43:07 252 222 4911 3
2019-09-17 08:43:07 207 210 4965 1
2019-09-17 08:43:08 234 250 5134 1
2019-09-17 08:43:08 252 222 5139 2
2019-09-17 08:43:08 206 210 4900 3
2019-09-17 08:43:09 252 222 4911 3
2019-09-17 08:43:09 206 210 4900 3
2019-09-17 08:43:09 223 247 4909 3
2019-09-17 08:43:10 206 210 5141 2
2019-09-17 08:43:10 229 237 4909 3
2019-09-17 08:43:10 252 222 5163 2
2019-09-17 08:43:12 226 241 5170 2
How can i do this? Is this something anyone can help me out with?
I appreciate any hint!
回答1:
IIUC, first lets merge the min and max values onto your data frame structure
import pandas as pd
import numpy as np
df3 = pd.merge(df,minmax,on='id',how='left')
date x y id min max
0 2019-09-17 08:43:06 206 210 4900 2019-09-17 08:43:06 2019-09-17 09:38:20
1 2019-09-17 08:43:06 234 236 4909 2019-09-17 08:43:06 2019-09-17 09:16:00
2 2019-09-17 08:43:06 251 222 4911 2019-09-17 08:43:06 2019-09-17 09:43:58
3 2019-09-17 08:43:07 231 244 4909 2019-09-17 08:43:06 2019-09-17 09:16:00
4 2019-09-17 08:43:07 252 222 4911 2019-09-17 08:43:06 2019-09-17 09:43:58
5 2019-09-17 08:43:07 207 210 4965 2019-09-17 09:27:14 2019-09-17 09:38:28
6 2019-09-17 08:43:08 234 250 5163 2019-09-17 09:38:03 2019-09-17 10:18:29
7 2019-09-17 08:43:08 252 222 4911 2019-09-17 08:43:06 2019-09-17 09:43:58
8 2019-09-17 08:43:08 206 210 4900 2019-09-17 08:43:06 2019-09-17 09:38:20
9 2019-09-17 08:43:09 252 222 4911 2019-09-17 08:43:06 2019-09-17 09:43:58
10 2019-09-17 08:43:09 206 210 4900 2019-09-17 08:43:06 2019-09-17 09:38:20
11 2019-09-17 08:43:09 223 247 4909 2019-09-17 08:43:06 2019-09-17 09:16:00
12 2019-09-17 08:43:10 206 210 4900 2019-09-17 08:43:06 2019-09-17 09:38:20
13 2019-09-17 08:43:10 229 237 4909 2019-09-17 08:43:06 2019-09-17 09:16:00
14 2019-09-17 08:43:10 252 222 4911 2019-09-17 08:43:06 2019-09-17 09:43:58
15 2019-09-17 08:43:12 226 241 4909 2019-09-17 08:43:06 2019-09-17 09:16:00
then let's work out the absolute sum of seconds between the date and the min value. if you need the actual value, you can read in the values as is, but you'll need to add in more logical steps to handle negative values.
s = abs(df3['min'] - df3['date']) / np.timedelta64(1,'s')
print(s)
0 0.0
1 0.0
2 0.0
3 1.0
4 1.0
5 2647.0
6 3295.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 4.0
13 4.0
14 4.0
15 6.0
dtype: float64
you can do this a number of ways, but i'll just use .loc
to set your values in order.
df3.loc[s <= 3, 'GroupSize'] = 3
df3.loc[(s > 3) & (s <= 7), 'GroupSize'] = 2
df3.loc[s > 7, 'GroupSize'] = 1
print(df3[['id','date','x','y','min','GroupSize']])
id date x y min GroupSize
0 4900 2019-09-17 08:43:06 206 210 2019-09-17 08:43:06 3.0
1 4909 2019-09-17 08:43:06 234 236 2019-09-17 08:43:06 3.0
2 4911 2019-09-17 08:43:06 251 222 2019-09-17 08:43:06 3.0
3 4909 2019-09-17 08:43:07 231 244 2019-09-17 08:43:06 3.0
4 4911 2019-09-17 08:43:07 252 222 2019-09-17 08:43:06 3.0
5 4965 2019-09-17 08:43:07 207 210 2019-09-17 09:27:14 1.0
6 5163 2019-09-17 08:43:08 234 250 2019-09-17 09:38:03 1.0
7 4911 2019-09-17 08:43:08 252 222 2019-09-17 08:43:06 3.0
8 4900 2019-09-17 08:43:08 206 210 2019-09-17 08:43:06 3.0
9 4911 2019-09-17 08:43:09 252 222 2019-09-17 08:43:06 3.0
10 4900 2019-09-17 08:43:09 206 210 2019-09-17 08:43:06 3.0
11 4909 2019-09-17 08:43:09 223 247 2019-09-17 08:43:06 3.0
12 4900 2019-09-17 08:43:10 206 210 2019-09-17 08:43:06 2.0
13 4909 2019-09-17 08:43:10 229 237 2019-09-17 08:43:06 2.0
14 4911 2019-09-17 08:43:10 252 222 2019-09-17 08:43:06 2.0
15 4909 2019-09-17 08:43:12 226 241 2019-09-17 08:43:06 2.0
来源:https://stackoverflow.com/questions/59465127/determining-group-size-based-entry-and-exit-times-of-ids-in-my-df