Pandas report top-n in group and pivot

自作多情 提交于 2019-12-22 08:51:40

问题


I am trying to summarise a dataframe by grouping along a single dimension d1 and reporting summary statistics for each element of d1. In particular I am interested in the top n (index and values) for a number of metrics. what I would like to produce is a row for each element of d1.

Say I have two dimensions d1, d2 and 4 metrics m1,m2,m3, m4

1) what is the suggested way of grouping by d1, and finding the top n d2 and metric value, for each of metrics m1 - m4.

in Wes's book Python for Data Analysis he suggests (page 35)

def get_top1000(group):
 return group.sort_index(by='births', ascending=False)[:1000]
grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)

Is that still the recommended way ( i am only interested in say top 5 d2 out of 1000s, and for multiple metrics) 2) Now next problem is that I want to to pivot the top 5 ( ie so I have one row for each element of d1)

so resulting data frame should look like this for dimensions d1,d2 and metric m1: index d1 and columns for top 5 values of d2 and corresponding values of m1

d1 d2-1 d2-2 d2-3 d2-4 d2-5 m1-1 m1-2 m1-3 m1-4 m1-5

....

so to pivot I have to create the ranking along d2 (ie 1 to 5 - this is my columns field). This would be easy if I always had 5 entries, but occasionally there are fewer than 5 elements of d2 for a given value of d1.

so could someone suggest how to add ranking to the grouping, so that I have the correct column index to perform the pivoting


回答1:


I don't have any toy data to use or expected results to compare to, but I think you want the following:

N = 1000
names = my_fake_data_loader()
grouped = names.groupby(['year', 'sex'])
grouped.apply(lambda g: g.sort_index(by='births', ascending=False).head(N))

And that will give to the first 1000 elements of each group.



来源:https://stackoverflow.com/questions/26309178/pandas-report-top-n-in-group-and-pivot

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!