Check if all values in dataframe column are the same

后端 未结 3 704
时光取名叫无心
时光取名叫无心 2020-12-09 15:48

I want to do a quick and easy check if all column values for counts are the same in a dataframe:

In:

import pandas as pd

d = {\'names\'         


        
相关标签:
3条回答
  • 2020-12-09 16:28

    An efficient way to do this is by comparing the first value with the rest, and using all:

    def is_unique(s):
        a = s.to_numpy() # s.values (pandas<0.24)
        return (a[0] == a).all()
    
    is_unique(df['counts'])
    # False
    

    For an entire dataframe

    In the case of wanting to perform the same task on an entire dataframe, we can extend the above by setting axis=0 in all:

    def unique_cols(df):
        a = df.to_numpy() # df.values (pandas<0.24)
        return (a[0] == a).all(0)
    

    For the shared example, we'd get:

    unique_cols(df)
    # array([False, False])
    

    Here's a benchmark of the above methods compared with some other approaches, such as using nunique (for a pd.Series):

    s_num = pd.Series(np.random.randint(0, 1_000, 1_100_000))
    
    perfplot.show(
        setup=lambda n: s_num.iloc[:int(n)], 
    
        kernels=[
            lambda s: s.nunique() == 1,
            lambda s: is_unique(s)
        ],
    
        labels=['nunique', 'first_vs_rest'],
        n_range=[2**k for k in range(0, 20)],
        xlabel='N'
    )
    


    And bellow are the timings for a pd.DataFrame. Let's compare too with a numba approach, which is especially useful here since we can take advantage of short-cutting as soon as we see a repeated value in a given column (note: the numba approach will only work with numerical data):

    from numba import njit
    
    @njit
    def unique_cols_nb(a):
        n_cols = a.shape[1]
        out = np.zeros(n_cols, dtype=np.int32)
        for i in range(n_cols):
            init = a[0, i]
            for j in a[1:, i]:
                if j != init:
                    break
            else:
                out[i] = 1
        return out
    

    If we compare the three methods:

    df = pd.DataFrame(np.concatenate([np.random.randint(0, 1_000, (500_000, 200)), 
                                      np.zeros((500_000, 10))], axis=1))
    
    perfplot.show(
        setup=lambda n: df.iloc[:int(n),:], 
    
        kernels=[
            lambda df: (df.nunique(0) == 1).values,
            lambda df: unique_cols_nb(df.values).astype(bool),
            lambda df: unique_cols(df) 
        ],
    
        labels=['nunique', 'unique_cols_nb', 'unique_cols'],
        n_range=[2**k for k in range(0, 20)],
        xlabel='N'
    )
    

    0 讨论(0)
  • 2020-12-09 16:29

    I think nunique does much more work than necessary. Iteration can stop at the first difference. This simple and generic solution uses itertools:

    import itertools
    
    def all_equal(iterable):
        "Returns True if all elements are equal to each other"
        g = itertools.groupby(iterable)
        return next(g, True) and not next(g, False)
    
    all_equal(df.counts)
    

    One can use this even to find all columns with constant contents in one go:

    constant_columns = df.columns[df.apply(all_equal)]
    

    A slightly more readable but less performant alternative:

    df.counts.min() == df.counts.max()
    

    Add skipna=False here if necessary.

    0 讨论(0)
  • 2020-12-09 16:51

    Update using np.unique

    len(np.unique(df.counts))==1
    False
    

    Or

    len(set(df.counts.tolist()))==1
    

    Or

    df.counts.eq(df.counts.iloc[0]).all()
    False
    

    Or

    df.counts.std()==0
    False
    
    0 讨论(0)
提交回复
热议问题