pandas - get most recent value of a particular column indexed by another column (get maximum value of a particular column indexed by another column)

前端 未结 6 663
深忆病人
深忆病人 2020-12-01 11:05

I have the following dataframe:

   obj_id   data_date   value
0  4        2011-11-01  59500    
1  2        2011-10-01  35200 
2  4        2010-07-31  24860          


        
相关标签:
6条回答
  • 2020-12-01 11:16

    This is another possible solution. I believe it's is the fastest.

    df.loc[df.groupby('obj_id').data_date.idxmax(),:]
    
    0 讨论(0)
  • 2020-12-01 11:25

    I believe to have found a more appropriate solution based off the ones in this thread. However mine uses the apply function of a dataframe instead of the aggregate. It also returns a new dataframe with the same columns as the original.

    df = pd.DataFrame({
    'CARD_NO': ['000', '001', '002', '002', '001', '111'],
    'DATE': ['2006-12-31 20:11:39','2006-12-27 20:11:53','2006-12-28 20:12:11','2006-12-28 20:12:13','2008-12-27 20:11:53','2006-12-30 20:11:39']})
    
    print df 
    df.groupby('CARD_NO').apply(lambda df:df['DATE'].values[df['DATE'].values.argmax()])
    

    Original

    CARD_NO                 DATE
    0     000  2006-12-31 20:11:39
    1     001  2006-12-27 20:11:53
    2     002  2006-12-28 20:12:11
    3     002  2006-12-28 20:12:13
    4     001  2008-12-27 20:11:53
    5     111  2006-12-30 20:11:39
    

    Returned dataframe:

    CARD_NO
    000        2006-12-31 20:11:39
    001        2008-12-27 20:11:53
    002        2006-12-28 20:12:13
    111        2006-12-30 20:11:39
    
    0 讨论(0)
  • 2020-12-01 11:33

    If the number of "obj_id"s is very high you'll want to sort the entire dataframe and then drop duplicates to get the last element.

    sorted = df.sort_index(by='data_date')
    result = sorted.drop_duplicates('obj_id', keep='last').values
    

    This should be faster (sorry I didn't test it) because you don't have to do a custom agg function, which is slow when there is a large number of keys. You might think it's worse to sort the entire dataframe, but in practice in python sorts are fast and native loops are slow.

    0 讨论(0)
  • 2020-12-01 11:34

    I like crewbum's answer, probably this is faster (sorry, didn't tested this yet, but i avoid sorting everything):

    df.groupby('obj_id').agg(lambda df: df.values[df['data_date'].values.argmax()])
    

    it uses numpys "argmax" function to find the rowindex in which the maximum appears.

    0 讨论(0)
  • 2020-12-01 11:40

    Updating thetainted1's answer since some of the functions have future warnings now as tommy.carstensen pointed out. Here's what worked for me:

    sorted = df.sort_values(by='data_date')
    
    result = sorted.drop_duplicates('obj_id', keep='last')
    
    0 讨论(0)
  • 2020-12-01 11:43

    The aggregate() method on groupby objects can be used to create a new DataFrame from a groupby object in a single step. (I'm not aware of a cleaner way to extract the first/last row of a DataFrame though.)

    In [12]: df.groupby('obj_id').agg(lambda df: df.sort('data_date')[-1:].values[0])
    Out[12]: 
             data_date  value
    obj_id                   
    1       2009-07-28  15860
    2       2011-10-01  35200
    4       2011-11-01  59500
    

    You can also perform aggregation on individual columns, in which case the aggregate function works on a Series object.

    In [25]: df.groupby('obj_id')['value'].agg({'diff': lambda s: s.max() - s.min()})
    Out[25]: 
              diff
    obj_id        
    1            0
    2       165000
    4        34640
    
    0 讨论(0)
提交回复
热议问题