问题
Suppose I have pandas DataFrame like this:
>>> df = pd.DataFrame({\'id\':[1,1,1,2,2,2,2,3,4],\'value\':[1,2,3,1,2,3,4,1,1]})
>>> df
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after group by:
>>> dfN = df.groupby(\'id\').apply(lambda x:x[\'value\'].reset_index()).reset_index()
>>> dfN
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
>>> dfN[dfN[\'level_1\'] <= 1][[\'id\', \'value\']]
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
回答1:
Did you try df.groupby('id').head(2)
Ouput generated:
>>> df.groupby('id').head(2)
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use df.groupby('id').head(2).reset_index(drop=True)
to remove the multindex and flatten the results.
>>> df.groupby('id').head(2).reset_index(drop=True)
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
回答2:
Since 0.14.1, you can now do nlargest
and nsmallest
on a groupby
object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True)
to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series
and SeriesGroupBy
.)
回答3:
Sometimes sorting the whole data ahead is very time consuming. We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
来源:https://stackoverflow.com/questions/20069009/pandas-get-topmost-n-records-within-each-group