How do I get the percentile for a row in a pandas dataframe?

后端 未结 3 2003
既然无缘
既然无缘 2020-12-30 08:48
Example DataFrame Values -  

0     78
1     38
2     42
3     48
4     31
5     89
6     94
7    102
8    122
9    122  

stats.percentileofscore(temp[\'INCOME\'].v         


        
相关标签:
3条回答
  • 2020-12-30 09:24

    TL; DR

    Use

    sz = temp['INCOME'].size-1
    temp['PCNT_LIN'] = temp['INCOME'].rank(method='max').apply(lambda x: 100.0*(x-1)/sz)
    
       INCOME    PCNT_LIN
    0      78   44.444444
    1      38   11.111111
    2      42   22.222222
    3      48   33.333333
    4      31    0.000000
    5      89   55.555556
    6      94   66.666667
    7     102   77.777778
    8     122  100.000000
    9     122  100.000000
    

    Answer

    It is actually very simple, once your understand the mechanics. When you are looking for percentile of a score, you already have the scores in each row. The only step left is understanding that you need percentile of numbers that are less or equal to the selected value. This is exactly what parameters kind='weak' of scipy.stats.percentileofscore() and method='average' of DataFrame.rank() do. In order to invert it, run Series.quantile() with interpolation='lower'.

    So, the behavior of the scipy.stats.percentileofscore(), Series.rank() and Series.quantile() is consistent, see below:

    In[]:
    temp = pd.DataFrame([  78, 38, 42, 48, 31, 89, 94, 102, 122, 122], columns=['INCOME'])
    temp['PCNT_RANK']=temp['INCOME'].rank(method='max', pct=True)
    temp['POF']  = temp['INCOME'].apply(lambda x: scipy.stats.percentileofscore(temp['INCOME'], x, kind='weak'))
    temp['QUANTILE_VALUE'] = temp['PCNT_RANK'].apply(lambda x: temp['INCOME'].quantile(x, 'lower'))
    temp['RANK']=temp['INCOME'].rank(method='max')
    sz = temp['RANK'].size - 1 
    temp['PCNT_LIN'] = temp['RANK'].apply(lambda x: (x-1)/sz)
    temp['CHK'] = temp['PCNT_LIN'].apply(lambda x: temp['INCOME'].quantile(x))
    
    temp
    
    Out[]:
       INCOME  PCNT_RANK    POF  QUANTILE_VALUE  RANK  PCNT_LIN    CHK
    0      78        0.5   50.0              78   5.0  0.444444   78.0
    1      38        0.2   20.0              38   2.0  0.111111   38.0
    2      42        0.3   30.0              42   3.0  0.222222   42.0
    3      48        0.4   40.0              48   4.0  0.333333   48.0
    4      31        0.1   10.0              31   1.0  0.000000   31.0
    5      89        0.6   60.0              89   6.0  0.555556   89.0
    6      94        0.7   70.0              94   7.0  0.666667   94.0
    7     102        0.8   80.0             102   8.0  0.777778  102.0
    8     122        1.0  100.0             122  10.0  1.000000  122.0
    9     122        1.0  100.0             122  10.0  1.000000  122.0
    

    Now in a column PCNT_RANK you get ratio of values that are smaller or equal to the one in a column INCOME. But if you want the "interpolated" ratio, it is in column PCNT_LIN. And as you use Series.rank() for calculations, it is pretty fast and will crunch you 255M numbers in seconds.


    Here I will explain how you get the value from using quantile() with linear interpolation:

    temp['INCOME'].quantile(0.11)
    37.93
    

    Our data temp['INCOME'] has only ten values. According to the formula from your link to Wiki the rank of 11th percentile is

    rank = 11*(10-1)/100 + 1 = 1.99
    

    The truncated part of the rank is 1, which corresponds to the value 31, and the value with the rank 2 (i.e. next bin) is 38. The value of fraction is the fractional part of the rank. This leads to the result:

     31 + (38-31)*(0.99) = 37.93
    

    For the values themselves, the fraction part have to be zero, so it is very easy to do the inverse calculation to get percentile:

    p = (rank - 1)*100/(10 - 1)
    

    I hope I made it more clear.

    0 讨论(0)
  • 2020-12-30 09:24

    This seems to work:

    A = np.sort(temp['INCOME'].values)
    np.interp(sample, A, np.linspace(0, 1, len(A)))
    

    For example:

    >>> temp.INCOME.quantile(np.interp([37.5, 38, 122, 121], A, np.linspace(0, 1, len(A))))
    0.103175     37.5
    0.111111     38.0
    1.000000    122.0
    0.883333    121.0
    Name: INCOME, dtype: float64
    

    Please note that this strategy only makes sense if you want to query a large enough number of values. Otherwise the sorting is too expensive.

    0 讨论(0)
  • 2020-12-30 09:24

    Let's consider the below dataframe:

    DataFrame

    In order to get the percentile of a column in pandas Dataframe we use the following code:

     survey['Nationality'].value_counts(normalize='index')
    

    Output:

    USA 0.333333

    China 0.250000

    India 0.250000

    Bangadesh 0.166667

    Name: Nationality, dtype: float64

    In order to get the percentile of a column in pandas Dataframe with respect to another categorical column

    pd.crosstab(survey.Sex,survey.Handedness,normalize = 'index')
    

    The output would be something like given below

    Output

    0 讨论(0)
提交回复
热议问题