pandas - get most recent value of a particular column indexed by another column (get maximum value of a particular column indexed by another column)

前端未结

关注

 6  679

I have the following dataframe:

   obj_id   data_date   value
0  4        2011-11-01  59500    
1  2        2011-10-01  35200 
2  4        2010-07-31  24860


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  误落风尘        
                
              
                            
                2020-12-01 11:16
              
            
            
                                                                       
This is another possible solution. I believe it's is the fastest.

df.loc[df.groupby('obj_id').data_date.idxmax(),:]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤街浪徒        
                
              
                            
                2020-12-01 11:25
              
            
            
                                                                       
I believe to have found a more appropriate solution based off the ones in this thread.
However mine uses the apply function of a dataframe instead of the aggregate.
It also returns a new dataframe with the same columns as the original.

df = pd.DataFrame({
'CARD_NO': ['000', '001', '002', '002', '001', '111'],
'DATE': ['2006-12-31 20:11:39','2006-12-27 20:11:53','2006-12-28 20:12:11','2006-12-28 20:12:13','2008-12-27 20:11:53','2006-12-30 20:11:39']})

print df 
df.groupby('CARD_NO').apply(lambda df:df['DATE'].values[df['DATE'].values.argmax()])


Original 

CARD_NO                 DATE
0     000  2006-12-31 20:11:39
1     001  2006-12-27 20:11:53
2     002  2006-12-28 20:12:11
3     002  2006-12-28 20:12:13
4     001  2008-12-27 20:11:53
5     111  2006-12-30 20:11:39


Returned dataframe:

CARD_NO
000        2006-12-31 20:11:39
001        2008-12-27 20:11:53
002        2006-12-28 20:12:13
111        2006-12-30 20:11:39

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  滥情空心        
                
              
                            
                2020-12-01 11:33
              
            
            
                                                                       
If the number of "obj_id"s is very high you'll want to sort the entire dataframe and then drop duplicates to get the last element.  

sorted = df.sort_index(by='data_date')
result = sorted.drop_duplicates('obj_id', keep='last').values


This should be faster (sorry I didn't test it) because you don't have to do a custom agg function, which is slow when there is a large number of keys.  You might think it's worse to sort the entire dataframe, but in practice in python sorts are fast and native loops are slow.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暗喜        
                
              
                            
                2020-12-01 11:34
              
            
            
                                                                       
I like crewbum's answer, probably this is faster (sorry, didn't tested this yet, but i avoid sorting everything):

df.groupby('obj_id').agg(lambda df: df.values[df['data_date'].values.argmax()])


it uses numpys "argmax" function to find the rowindex in which the maximum appears.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  自闭症患者        
                
              
                            
                2020-12-01 11:40
              
            
            
                                                                       
Updating thetainted1's answer since some of the functions have future warnings now as  tommy.carstensen pointed out. Here's what worked for me: 

sorted = df.sort_values(by='data_date')

result = sorted.drop_duplicates('obj_id', keep='last')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  隐瞒了意图╮        
                
              
                            
                2020-12-01 11:43
              
            
            
                                                                       
The aggregate() method on groupby objects can be used to create a new DataFrame from a groupby object in a single step.  (I'm not aware of a cleaner way to extract the first/last row of a DataFrame though.)

In [12]: df.groupby('obj_id').agg(lambda df: df.sort('data_date')[-1:].values[0])
Out[12]: 
         data_date  value
obj_id                   
1       2009-07-28  15860
2       2011-10-01  35200
4       2011-11-01  59500


You can also perform aggregation on individual columns, in which case the aggregate function works on a Series object.

In [25]: df.groupby('obj_id')['value'].agg({'diff': lambda s: s.max() - s.min()})
Out[25]: 
          diff
obj_id        
1            0
2       165000
4        34640

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复