Identify consecutive same values in Pandas Dataframe, with a Groupby

后端 未结 4 1671
说谎
说谎 2020-12-08 05:18

I have the following dataframe df:

data={\'id\':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
      \'value\':[2,2,3,2,2,2,3,3,3,3,1,4,1,1,1,4,4,1,1,1,1,1]}         


        
4条回答
  •  春和景丽
    2020-12-08 06:03

    You can try this; 1) Create an extra group variable with df.value.diff().ne(0).cumsum() to denote the value changes; 2) use transform('size') to calculate the group size and compare with three, then you get the flag column you need:

    df['flag'] = df.value.groupby([df.id, df.value.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int) 
    df
    


    Break downs:

    1) diff is not equal to zero (which is literally what df.value.diff().ne(0) means) gives a condition True whenever there is a value change:

    df.value.diff().ne(0)
    #0      True
    #1     False
    #2      True
    #3      True
    #4     False
    #5     False
    #6      True
    #7     False
    #8     False
    #9     False
    #10     True
    #11     True
    #12     True
    #13    False
    #14    False
    #15     True
    #16    False
    #17     True
    #18    False
    #19    False
    #20    False
    #21    False
    #Name: value, dtype: bool
    

    2) Then cumsum gives a non descending sequence of ids where each id denotes a consecutive chunk with same values, note when summing boolean values, True is considered as one while False is considered as zero:

    df.value.diff().ne(0).cumsum()
    #0     1
    #1     1
    #2     2
    #3     3
    #4     3
    #5     3
    #6     4
    #7     4
    #8     4
    #9     4
    #10    5
    #11    6
    #12    7
    #13    7
    #14    7
    #15    8
    #16    8
    #17    9
    #18    9
    #19    9
    #20    9
    #21    9
    #Name: value, dtype: int64
    

    3) combined with id column, you can group the data frame, calculate the group size and get the flag column.

提交回复
热议问题