Find percentile in pandas dataframe based on groups

眉间皱痕 提交于 2021-01-28 09:45:48

问题


Season  Name    value
2001    arkansas    3.497
2002    arkansas    3.0935
2003    arkansas    3.3625
2015    arkansas    3.766
2001    colorado    2.21925
2002    colorado    1.4795
2010    colorado    2.89175
2011    colorado    2.48825
2012    colorado    2.08475
2013    colorado    1.68125
2014    colorado    2.5555
2015    colorado    2.48825

In the dataframe above, I want to identify top and bottom 10 percentile values in column value for each state (arkansas and colorado). How do I do that? I can identify top and bottom percentile for entire value column like so:

np.searchsorted(np.percentile(a, [10, 90]), a))

回答1:


You can use groupby + quantile:

df.groupby('Name')['value'].quantile([.1, .9])

Name
arkansas  0.1    3.174200
          0.9    3.685300
colorado  0.1    1.620725
          0.9    2.656375
Name: value, dtype: float64

And then call np.searchsorted.


Alternatively, use qcut.

df.groupby('Name').apply(lambda x: 
       pd.qcut(x['value'], [.1, .9]))

Name
arkansas  0     (3.173, 3.685]
          1                NaN
          2     (3.173, 3.685]
          3                NaN
colorado  4      (1.62, 2.656]
          5                NaN
          6                NaN
          7      (1.62, 2.656]
          8      (1.62, 2.656]
          9      (1.62, 2.656]
          10     (1.62, 2.656]
          11     (1.62, 2.656]
Name: value, dtype: object



回答2:


If the variable for your dataframe is df, this should work. I'm not sure what you want your output to look like, but I just created code for a dictionary, where each key is a state. Also, since you have very few values, I used the option "nearest" for the argument interpolation (the default value is interpolation). To see the possible options, check out the documentation for the function here.

import pandas as pd
import numpy as np
df = pd.read_csv('stacktest.csv')
#array of unique state names from the dataframe
states = np.unique(df['Name'])
#empty dictionary 
state_data = dict()
for state in states:
    state_data[state] = np.percentile(df[df['Name'] == state]['value'],[10,90],interpolation = 'nearest')
print(state_data)



来源:https://stackoverflow.com/questions/47007073/find-percentile-in-pandas-dataframe-based-on-groups

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!