Groupby filter, based on consecutive sequence sorted and ID and Date column

白昼怎懂夜的黑 提交于 2019-12-02 10:42:31

Try this:

import re
import pandas as pd
import numpy as np

df = pd.read_clipboard()
df['Date'] = pd.to_datetime(df['Date'])
print(df)

Output:

    ID Status       Date
0    1      F 2017-06-22
1    1      M 2017-07-22
2    1      P 2017-10-22
3    1      F 2018-06-22
4    1      P 2018-08-22
5    1      F 2018-10-22
6    1      F 2019-03-22
7    2      M 2017-06-29
8    2      F 2017-09-29
9    2      F 2018-01-29
10   2      M 2018-03-29
11   2      P 2018-08-29
12   2      M 2018-10-29
13   2      F 2018-12-29
14   3      M 2017-03-20
15   3      F 2018-06-20
16   3      P 2018-08-20
17   3      M 2018-10-20
18   3      F 2018-11-20
19   3      P 2018-12-20
20   3      F 2019-03-20
22   4      M 2017-08-10
23   4      F 2018-06-10
24   4      P 2018-08-10
25   4      F 2018-12-10
26   4      M 2019-01-10
27   4      F 2019-06-10
31   7      M 2017-08-10
32   7      F 2018-04-10
33   7      P 2018-08-10
34   7      F 2018-11-10
33   7      P 2019-08-10
34   7      F 2019-10-10

My first trick is to use regular expressions to find the location of a substring in the longer string. Using join, I build a string and look for the pattern in that string. Define pattern by putting the longer pattern first.

pattern = "FPFPF|FPF"
def f(x):
    m = re.search(pattern, ''.join(x['Status']))
    return x[m.start():m.end()] if m else None
df1 = df.groupby('ID', group_keys=False).apply(f)
print(df1)

Output:

    ID Status       Date
3    1      F 2018-06-22
4    1      P 2018-08-22
5    1      F 2018-10-22
18   3      F 2018-11-20
19   3      P 2018-12-20
20   3      F 2019-03-20
23   4      F 2018-06-10
24   4      P 2018-08-10
25   4      F 2018-12-10
32   7      F 2018-04-10
33   7      P 2018-08-10
34   7      F 2018-11-10
33   7      P 2019-08-10
34   7      F 2019-10-10

Calculate duration

df1['Duration'] = df1.groupby('ID')['Date'].diff().dt.days
print(df1)

Output:

    ID Status       Date  Duration
3    1      F 2018-06-22       NaN
4    1      P 2018-08-22      61.0
5    1      F 2018-10-22      61.0
18   3      F 2018-11-20       NaN
19   3      P 2018-12-20      30.0
20   3      F 2019-03-20      90.0
23   4      F 2018-06-10       NaN
24   4      P 2018-08-10      61.0
25   4      F 2018-12-10     122.0
32   7      F 2018-04-10       NaN
33   7      P 2018-08-10     122.0
34   7      F 2018-11-10      92.0
33   7      P 2019-08-10     273.0
34   7      F 2019-10-10      61.0

Aggregate during using the latest pandas 0.25 with aggregation relabeling:

df_out = df1.groupby(['ID',(df1['Status'] != 'F').cumsum()])['Duration']\
            .agg(F_P_Duration = lambda x: x.iloc[0], 
                 F_F_Duration = 'sum').dropna()
print(df_out)

Output:

           F_P_Duration  F_F_Duration
ID Status                            
1  1               61.0         122.0
3  2               30.0         120.0
4  3               61.0         183.0
7  4              122.0         214.0
   5              273.0         334.0

Update for prior to Pandas 0.25...

df1.groupby(['ID',(df1['Status'] != 'F').cumsum()])['Duration']\
            .agg(['first', 'sum', 'last']).dropna()\
            .rename(columns={'first':'F_P_Duration',
                             'sum':'F_F_Duration',
                             'last':'P_F_Duration'})

Output:

           F_P_Duration  F_F_Duration  P_F_Duration
ID Status                                          
1  1               61.0         122.0          61.0
3  2               30.0         120.0          90.0
4  3               61.0         183.0         122.0
7  4              122.0         214.0          92.0
   5              273.0         334.0          61.0

Try if this works for you

mask=df['Status']=='M' 
dff=df.loc[~mask, :].copy()
dff['a']= dff.groupby(["ID"])['Status'].shift()
df2=dff.loc[dff['Status'] != dff['a']]
df2.drop('a' , axis=1, inplace=True)

or just this would give the same result as above

mask=df['Status']=='M'
dff=df.loc[~mask, :].copy()
dff= dff.loc[dff.groupby(["ID"])['Status'].shift()!= dff['Status']]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!