Identify consecutive same values in Pandas Dataframe, with a Groupby

后端未结

关注

 4  1677

I have the following dataframe df:

data={\'id\':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
      \'value\':[2,2,3,2,2,2,3,3,3,3,1,4,1,1,1,4,4,1,1,1,1,1]}


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  耶瑟儿～        
                
              
                            
                2020-12-08 06:01
              
            
            
                                                                       
See EDIT2 for a more robust solution

Same result, but a little bit faster:

labels = (df.value != df.value.shift()).cumsum()
df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)

    id  value  flag
0    1      2     0
1    1      2     0
2    1      3     0
3    1      2     1
4    1      2     1
5    1      2     1
6    1      3     1
7    1      3     1
8    1      3     1
9    1      3     1
10   2      1     0
11   2      4     0
12   2      1     1
13   2      1     1
14   2      1     1
15   2      4     0
16   2      4     0
17   2      1     1
18   2      1     1
19   2      1     1
20   2      1     1
21   2      1     1


Where:


df.value != df.value.shift() gives the value change
cumsum() creates "labels" for each group of same value
labels.value_counts() counts the occurrences of each label
labels.map(...) replaces labels by the counts computed above
>= 3 creates a boolean mask on count value
astype(int) casts the booleans to int


In my hands it give 1.03ms on your df, compared to 2.1ms for Psidoms' approach.
But mine is not one-liner.



EDIT:

A mix between both approaches is even faster

labels = df.value.diff().ne(0).cumsum()
df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)


Gives 911µs with your sample df.



EDIT2: correct solution to account for id change, as pointed by @clg4

labels = (df.value.diff().ne(0) | df.id.diff().ne(0)).cumsum()
df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)


Where ... | df.id.diff().ne(0) increment the label where the id changes

This works even with same value on id change (tested with value 3 on index 10) and takes 1.28ms

EDIT3: Better explanations

Take the case where index 10 has value 3. df.id.diff().ne(0)

data={'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
      'value':[2,2,3,2,2,2,3,3,3,3,3,4,1,1,1,4,4,1,1,1,1,1]}
df=pd.DataFrame.from_dict(data)

df['id_diff'] = df.id.diff().ne(0).astype(int)
df['val_diff'] = df.value.diff().ne(0).astype(int)
df['diff_or'] = (df.id.diff().ne(0) | df.value.diff().ne(0)).astype(int)
df['labels'] = df['diff_or'].cumsum()

     id  value  id_diff  val_diff  diff_or  labels
 0    1      2        1         1        1       1
 1    1      2        0         0        0       1
 2    1      3        0         1        1       2
 3    1      2        0         1        1       3
 4    1      2        0         0        0       3
 5    1      2        0         0        0       3
 6    1      3        0         1        1       4
 7    1      3        0         0        0       4
 8    1      3        0         0        0       4
 9    1      3        0         0        0       4
>10   2      3        1    |    0    =   1       5 <== label increment
 11   2      4        0         1        1       6
 12   2      1        0         1        1       7
 13   2      1        0         0        0       7
 14   2      1        0         0        0       7
 15   2      4        0         1        1       8
 16   2      4        0         0        0       8
 17   2      1        0         1        1       9
 18   2      1        0         0        0       9
 19   2      1        0         0        0       9
 20   2      1        0         0        0       9
 21   2      1        0         0        0       9


The | is operator "bitwise-or", which gives True as long as one of the elements is True. So if there is no diff in value where the id changes, the | reflects the id change. Otherwise it changes nothing.
When .cumsum() is performed, the label is incremented where the id changes, so the value 3 at index 10 is not grouped with values 3 from indexes 6-9.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2020-12-08 06:03
              
            
            
                                                                       
You can try this; 1) Create an extra group variable with df.value.diff().ne(0).cumsum() to denote the value changes; 2) use transform('size') to calculate the group size and compare with three, then you get the flag column you need:

df['flag'] = df.value.groupby([df.id, df.value.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int) 
df






Break downs:

1) diff is not equal to zero (which is literally what df.value.diff().ne(0) means) gives a condition True whenever there is a value change:

df.value.diff().ne(0)
#0      True
#1     False
#2      True
#3      True
#4     False
#5     False
#6      True
#7     False
#8     False
#9     False
#10     True
#11     True
#12     True
#13    False
#14    False
#15     True
#16    False
#17     True
#18    False
#19    False
#20    False
#21    False
#Name: value, dtype: bool


2) Then cumsum gives a non descending sequence of ids where each id denotes a consecutive chunk with same values, note when summing boolean values, True is considered as one while False is considered as zero:

df.value.diff().ne(0).cumsum()
#0     1
#1     1
#2     2
#3     3
#4     3
#5     3
#6     4
#7     4
#8     4
#9     4
#10    5
#11    6
#12    7
#13    7
#14    7
#15    8
#16    8
#17    9
#18    9
#19    9
#20    9
#21    9
#Name: value, dtype: int64


3) combined with id column, you can group the data frame, calculate the group size and get the flag column.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  星月不相逢        
                
              
                            
                2020-12-08 06:08
              
            
            
                                                                       
#try this simpler version
a= pd.Series([1,1,1,2,3,4,5,5,5,7,8,0,0,0])
b= a.groupby([a.ne(0), a]).transform('size').ge(3).astype('int')
#ge(x) <- x is the number of consecutive repeated values 
print b

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  庸人自扰        
                
              
                            
                2020-12-08 06:13
              
            
            
                                                                       
df=pd.DataFrame.from_dict(
        {'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
         'value':[2,2,3,2,2,2,3,3,3,3,1,4,1,1,1,4,4,1,1,1,1,1]})

df2 = df.groupby((df['value'].shift() != df['value']).\
                cumsum()).filter(lambda x: len(x) >= 3)

df['flag'] = np.where(df.index.isin(df2.index),1,0)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复