group by pandas dataframe and select latest in each group

不打扰是莪最后的温柔 提交于 2019-11-27 04:12:18

use idxmax in groupby and slice df with loc

df.loc[df.groupby('id').date.idxmax()]

    id  product       date
2  220     6647 2014-10-16
5  826     3380 2015-05-19
8  901     4555 2014-11-01

You can also use tail with groupby to get the last n values of the group:

df.sort_values('date').groupby('id').tail(1)

    id  product date
2   220 6647    2014-10-16
8   901 4555    2014-11-01
5   826 3380    2015-05-19

To use .tail() as an aggregation method and keep your grouping intact:

df.sort_values('date').groupby('id').apply(lambda x: x.tail(1))

        id  product date
id              
220 2   220 6647    2014-10-16
826 5   826 3380    2015-05-19
901 8   901 4555    2014-11-01

I had a similar problem and ended up using drop_duplicates rather than groupby.

It seems to run significatively faster on large datasets when compared with other methods suggested above.

df.sort_values(by="date").drop_duplicates(subset=["id"], keep="last")

    id  product        date
2  220     6647  2014-10-16
8  901     4555  2014-11-01
5  826     3380  2015-05-19
Sandu Ursu

Given a dataframe sorted by date, you can obtain what you ask for in a number of ways:

Like this:

df.groupby(['id','product']).last()

like this:

df.groupby(['id','product']).nth(-1)

or like this:

df.groupby(['id','product']).max()

If you don't want id and product to appear as index use groupby(['id', 'product'], as_index=False). Alternatively use:

df.groupby(['id','product']).tail(1)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!