Python: Pandas - Delete the first row by group

前端未结

关注

 3  921

I have the following large dataframe (df) that looks like this:

    ID     date        PRICE       
1   10001  19920103  14.500    
2   10001  1


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2020-12-10 13:11
              
            
            
                                                                       
Another one line code is df.groupby('ID').apply(lambda group: group.iloc[1:, 1:])

Out[100]: 
             date  PRICE
ID                      
10001 2  19920106   14.5
      3  19920107   14.5
10002 5  19920109   14.5
      6  19920110   14.5
10003 8  19920114   14.5
      9  19920115   15.0

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2020-12-10 13:21
              
            
            
                                                                       
You could use groupby/transform to prepare a boolean mask which is True for the rows you want and  False for the rows you don't want. Once you have such a boolean mask, you can select the sub-DataFrame using df.loc[mask]:

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'ID': [10001, 10001, 10001, 10002, 10002, 10002, 10003, 10003, 10003],
     'PRICE': [14.5, 14.5, 14.5, 15.125, 14.5, 14.5, 14.5, 14.5, 15.0],
     'date': [19920103, 19920106, 19920107, 19920108, 19920109, 19920110,
              19920113, 19920114, 19920115]},
    index = range(1,10)) 

def mask_first(x):
    result = np.ones_like(x)
    result[0] = 0
    return result

mask = df.groupby(['ID'])['ID'].transform(mask_first).astype(bool)
print(df.loc[mask])


yields

      ID  PRICE      date
2  10001   14.5  19920106
3  10001   14.5  19920107
5  10002   14.5  19920109
6  10002   14.5  19920110
8  10003   14.5  19920114
9  10003   15.0  19920115




Since you're interested in efficiency, here is a benchmark:

import timeit
import operator
import numpy as np
import pandas as pd

N = 10000
df = pd.DataFrame(
    {'ID': np.random.randint(100, size=(N,)),
     'PRICE': np.random.random(N),
     'date': np.random.random(N)}) 

def using_mask(df):
    def mask_first(x):
        result = np.ones_like(x)
        result[0] = 0
        return result

    mask = df.groupby(['ID'])['ID'].transform(mask_first).astype(bool)
    return df.loc[mask]

def using_apply(df):
    return df.groupby('ID').apply(lambda group: group.iloc[1:, 1:])

def using_apply_alt(df):
    return df.groupby('ID', group_keys=False).apply(lambda x: x[1:])

timing = dict()
for func in (using_mask, using_apply, using_apply_alt):
    timing[func] = timeit.timeit(
        '{}(df)'.format(func.__name__), 
        'from __main__ import df, {}'.format(func.__name__), number=100)

for func, t in sorted(timing.items(), key=operator.itemgetter(1)):
    print('{:16}: {:.2f}'.format(func.__name__, t))


reports

using_mask      : 0.85
using_apply_alt : 2.04
using_apply     : 3.70

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  既然无缘        
                
              
                            
                2020-12-10 13:30
              
            
            
                                                                       
Old but still watched quite often: a much faster solution is nth(0) combined with drop duplicates:

def using_nth(df):
    to_del = df.groupby('ID',as_index=False).nth(0)
    return pd.concat([df,to_del]).drop_duplicates(keep=False)


In my system the times for unutbus setting are:

using_nth       : 0.43
using_apply_alt : 1.93
using_mask      : 2.11
using_apply     : 4.33

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复