Check if all values in dataframe column are the same

后端未结

关注

 3  711

I want to do a quick and easy check if all column values for counts are the same in a dataframe:

In:

import pandas as pd

d = {\'names\'


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  借酒劲吻你        
                
              
                            
                2020-12-09 16:28
              
            
            
                                                                       
An efficient way to do this is by comparing the first value with the rest, and using all:
def is_unique(s):
    a = s.to_numpy() # s.values (pandas<0.24)
    return (a[0] == a).all()

is_unique(df['counts'])
# False


For an entire dataframe
In the case of wanting to perform the same task on an entire dataframe,  we can extend the above by setting axis=0 in all:
def unique_cols(df):
    a = df.to_numpy() # df.values (pandas<0.24)
    return (a[0] == a).all(0)

For the shared example, we'd get:
unique_cols(df)
# array([False, False])


Here's a benchmark of the above methods compared with some other approaches, such as using nunique (for a pd.Series):
s_num = pd.Series(np.random.randint(0, 1_000, 1_100_000))

perfplot.show(
    setup=lambda n: s_num.iloc[:int(n)], 

    kernels=[
        lambda s: s.nunique() == 1,
        lambda s: is_unique(s)
    ],

    labels=['nunique', 'first_vs_rest'],
    n_range=[2**k for k in range(0, 20)],
    xlabel='N'
)



And bellow are the timings for a pd.DataFrame. Let's compare too with a numba approach, which is especially useful here since we can take advantage of short-cutting as soon as we see a repeated value in a given column (note: the numba approach will only work with numerical data):
from numba import njit

@njit
def unique_cols_nb(a):
    n_cols = a.shape[1]
    out = np.zeros(n_cols, dtype=np.int32)
    for i in range(n_cols):
        init = a[0, i]
        for j in a[1:, i]:
            if j != init:
                break
        else:
            out[i] = 1
    return out

If we compare the three methods:
df = pd.DataFrame(np.concatenate([np.random.randint(0, 1_000, (500_000, 200)), 
                                  np.zeros((500_000, 10))], axis=1))

perfplot.show(
    setup=lambda n: df.iloc[:int(n),:], 

    kernels=[
        lambda df: (df.nunique(0) == 1).values,
        lambda df: unique_cols_nb(df.values).astype(bool),
        lambda df: unique_cols(df) 
    ],

    labels=['nunique', 'unique_cols_nb', 'unique_cols'],
    n_range=[2**k for k in range(0, 20)],
    xlabel='N'
)


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  终归单人心        
                
              
                            
                2020-12-09 16:29
              
            
            
                                                                       
I think nunique does much more work than necessary.  Iteration can stop at the first difference.  This simple and generic solution uses itertools:

import itertools

def all_equal(iterable):
    "Returns True if all elements are equal to each other"
    g = itertools.groupby(iterable)
    return next(g, True) and not next(g, False)

all_equal(df.counts)


One can use this even to find all columns with constant contents in one go:

constant_columns = df.columns[df.apply(all_equal)]




A slightly more readable but less performant alternative:

df.counts.min() == df.counts.max()


Add skipna=False here if necessary.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  自闭症患者        
                
              
                            
                2020-12-09 16:51
              
            
            
                                                                       
Update using np.unique 

len(np.unique(df.counts))==1
False


Or 

len(set(df.counts.tolist()))==1


Or

df.counts.eq(df.counts.iloc[0]).all()
False


Or 

df.counts.std()==0
False

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复