I have the following dataframe:
obj_id data_date value
0 4 2011-11-01 59500
1 2 2011-10-01 35200
2 4 2010-07-31 24860
This is another possible solution. I believe it's is the fastest.
df.loc[df.groupby('obj_id').data_date.idxmax(),:]
I believe to have found a more appropriate solution based off the ones in this thread. However mine uses the apply function of a dataframe instead of the aggregate. It also returns a new dataframe with the same columns as the original.
df = pd.DataFrame({
'CARD_NO': ['000', '001', '002', '002', '001', '111'],
'DATE': ['2006-12-31 20:11:39','2006-12-27 20:11:53','2006-12-28 20:12:11','2006-12-28 20:12:13','2008-12-27 20:11:53','2006-12-30 20:11:39']})
print df
df.groupby('CARD_NO').apply(lambda df:df['DATE'].values[df['DATE'].values.argmax()])
Original
CARD_NO DATE
0 000 2006-12-31 20:11:39
1 001 2006-12-27 20:11:53
2 002 2006-12-28 20:12:11
3 002 2006-12-28 20:12:13
4 001 2008-12-27 20:11:53
5 111 2006-12-30 20:11:39
Returned dataframe:
CARD_NO
000 2006-12-31 20:11:39
001 2008-12-27 20:11:53
002 2006-12-28 20:12:13
111 2006-12-30 20:11:39
If the number of "obj_id"s is very high you'll want to sort the entire dataframe and then drop duplicates to get the last element.
sorted = df.sort_index(by='data_date')
result = sorted.drop_duplicates('obj_id', keep='last').values
This should be faster (sorry I didn't test it) because you don't have to do a custom agg function, which is slow when there is a large number of keys. You might think it's worse to sort the entire dataframe, but in practice in python sorts are fast and native loops are slow.
I like crewbum's answer, probably this is faster (sorry, didn't tested this yet, but i avoid sorting everything):
df.groupby('obj_id').agg(lambda df: df.values[df['data_date'].values.argmax()])
it uses numpys "argmax" function to find the rowindex in which the maximum appears.
Updating thetainted1's answer since some of the functions have future warnings now as tommy.carstensen pointed out. Here's what worked for me:
sorted = df.sort_values(by='data_date')
result = sorted.drop_duplicates('obj_id', keep='last')
The aggregate() method on groupby objects can be used to create a new DataFrame from a groupby object in a single step. (I'm not aware of a cleaner way to extract the first/last row of a DataFrame though.)
In [12]: df.groupby('obj_id').agg(lambda df: df.sort('data_date')[-1:].values[0])
Out[12]:
data_date value
obj_id
1 2009-07-28 15860
2 2011-10-01 35200
4 2011-11-01 59500
You can also perform aggregation on individual columns, in which case the aggregate function works on a Series object.
In [25]: df.groupby('obj_id')['value'].agg({'diff': lambda s: s.max() - s.min()})
Out[25]:
diff
obj_id
1 0
2 165000
4 34640