Pandas: remove group from the data when a value in the group meets a required condition

前端 未结 3 1811
盖世英雄少女心
盖世英雄少女心 2021-01-04 03:44

I have groupings of values in the data and within each group, I would like to check if a value within the group is below 8. If this condition is met, the entire

3条回答
  •  耶瑟儿~
    2021-01-04 04:10

    Create a Boolean Series with your condition then groupby + transform('any') to form a mask for the original DataFrame. This allows you to simply slice the original DataFrame.

    df[~df.Count.lt(8).groupby(df.Groups).transform('any')]
    #   Groups  Count
    #3       2     12
    #4       2     15
    #5       2     21
    

    While the syntax of groupby + filter is more straightforward, it performs much worse for a large number of groups, so creating the Boolean mask with transform is preferred. In this example there's over a 1000x improvement. The .isin method works extremely fast for a single column but would require switching to a merge if grouping on multiple columns.

    import pandas as pd
    import numpy as np
    
    np.random.seed(123)
    N = 50000
    df = pd.DataFrame({'Groups': [*range(N//2)]*2,
                       'Count': np.random.randint(0, 1000, N)})
    
    # Double check both are equivalent
    (df.groupby('Groups').filter(lambda x: x['Count'].min() >= 8)
      == df[~df.Count.lt(8).groupby(df.Groups).transform('any')]).all().all()
    #True
    
    %timeit df.groupby('Groups').filter(lambda x: x['Count'].min() >= 8)
    #8.15 s ± 80.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %timeit df[~df.Count.lt(8).groupby(df.Groups).transform('any')]
    #6.54 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %timeit df[~df['Groups'].isin(df.loc[df['Count'] < 8, 'Groups'].unique())]
    #2.88 ms ± 24 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

提交回复
热议问题