Python: Pandas - Delete the first row by group

前端 未结 3 913
不知归路
不知归路 2020-12-10 12:27

I have the following large dataframe (df) that looks like this:

    ID     date        PRICE       
1   10001  19920103  14.500    
2   10001  1         


        
相关标签:
3条回答
  • 2020-12-10 13:11

    Another one line code is df.groupby('ID').apply(lambda group: group.iloc[1:, 1:])

    Out[100]: 
                 date  PRICE
    ID                      
    10001 2  19920106   14.5
          3  19920107   14.5
    10002 5  19920109   14.5
          6  19920110   14.5
    10003 8  19920114   14.5
          9  19920115   15.0
    
    0 讨论(0)
  • 2020-12-10 13:21

    You could use groupby/transform to prepare a boolean mask which is True for the rows you want and False for the rows you don't want. Once you have such a boolean mask, you can select the sub-DataFrame using df.loc[mask]:

    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame(
        {'ID': [10001, 10001, 10001, 10002, 10002, 10002, 10003, 10003, 10003],
         'PRICE': [14.5, 14.5, 14.5, 15.125, 14.5, 14.5, 14.5, 14.5, 15.0],
         'date': [19920103, 19920106, 19920107, 19920108, 19920109, 19920110,
                  19920113, 19920114, 19920115]},
        index = range(1,10)) 
    
    def mask_first(x):
        result = np.ones_like(x)
        result[0] = 0
        return result
    
    mask = df.groupby(['ID'])['ID'].transform(mask_first).astype(bool)
    print(df.loc[mask])
    

    yields

          ID  PRICE      date
    2  10001   14.5  19920106
    3  10001   14.5  19920107
    5  10002   14.5  19920109
    6  10002   14.5  19920110
    8  10003   14.5  19920114
    9  10003   15.0  19920115
    

    Since you're interested in efficiency, here is a benchmark:

    import timeit
    import operator
    import numpy as np
    import pandas as pd
    
    N = 10000
    df = pd.DataFrame(
        {'ID': np.random.randint(100, size=(N,)),
         'PRICE': np.random.random(N),
         'date': np.random.random(N)}) 
    
    def using_mask(df):
        def mask_first(x):
            result = np.ones_like(x)
            result[0] = 0
            return result
    
        mask = df.groupby(['ID'])['ID'].transform(mask_first).astype(bool)
        return df.loc[mask]
    
    def using_apply(df):
        return df.groupby('ID').apply(lambda group: group.iloc[1:, 1:])
    
    def using_apply_alt(df):
        return df.groupby('ID', group_keys=False).apply(lambda x: x[1:])
    
    timing = dict()
    for func in (using_mask, using_apply, using_apply_alt):
        timing[func] = timeit.timeit(
            '{}(df)'.format(func.__name__), 
            'from __main__ import df, {}'.format(func.__name__), number=100)
    
    for func, t in sorted(timing.items(), key=operator.itemgetter(1)):
        print('{:16}: {:.2f}'.format(func.__name__, t))
    

    reports

    using_mask      : 0.85
    using_apply_alt : 2.04
    using_apply     : 3.70
    
    0 讨论(0)
  • 2020-12-10 13:30

    Old but still watched quite often: a much faster solution is nth(0) combined with drop duplicates:

    def using_nth(df):
        to_del = df.groupby('ID',as_index=False).nth(0)
        return pd.concat([df,to_del]).drop_duplicates(keep=False)
    

    In my system the times for unutbus setting are:

    using_nth       : 0.43
    using_apply_alt : 1.93
    using_mask      : 2.11
    using_apply     : 4.33
    
    0 讨论(0)
提交回复
热议问题