Pandas rank by multiple columns

后端 未结 5 681
小鲜肉
小鲜肉 2020-12-16 19:05

I am trying to rank a pandas data frame based on two columns. I can rank it based on one column, but how can to rank it based on two columns? \'SaleCount\', then \'TotalReve

相关标签:
5条回答
  • 2020-12-16 19:12

    (The correct way to rank two (nonnegative) int columns is as per Nickil Maveli's answer, to cast them to string, concatenate them and cast back to int.)

    However here's a shortcut if you know that TotalRevenue is constrained to some range e.g. 0 to MAX_REVENUE=100,000 ; directly manipulate them as nonnegative integers:

    df['Rank'] = (df['SaleCount']*MAX_REVENUE + df['TotalRevenue']).rank(method='dense', ascending=False).astype(int)
    
    df.sort_values('Rank2')
    
    0 讨论(0)
  • 2020-12-16 19:18

    sort_values + GroupBy.ngroup

    This will give the dense ranking.

    Columns should be sorted in the desired order prior to the groupby. Specifying sort=False within the groupby then respects this sorting so that groups are labeled in the order they appear within the sorted DataFrame.

    cols = ['SaleCount', 'TotalRevenue']
    df['Rank'] = df.sort_values(cols, ascending=False).groupby(cols, sort=False).ngroup() + 1
    

    Output:

    print(df.sort_values('Rank'))
    
       TotalRevenue        Date  SaleCount shops  Rank
    1          9000  2016-12-02        100    S2     1
    5          2000  2016-12-02        100    S8     2
    3           750  2016-12-02         35    S5     3
    2          1000  2016-12-02         30    S1     4
    7           600  2016-12-02         30    S7     5
    4           500  2016-12-02         20    S4     6
    9           500  2016-12-02         20   S10     6
    0           300  2016-12-02         10    S3     7
    8            50  2016-12-02          2    S9     8
    6             0  2016-12-02          0    S6     9
    
    0 讨论(0)
  • 2020-12-16 19:25

    The generic way to do that is to group the desired fiels in a tuple, whatever the types.

    df["Rank"] = df[["SaleCount","TotalRevenue"]].apply(tuple,axis=1)\
                 .rank(method='dense',ascending=False).astype(int)
    
    df.sort_values("Rank")
    
       TotalRevenue        Date  SaleCount shops  Rank
    1          9000  2016-12-02        100    S2     1
    5          2000  2016-12-02        100    S8     2
    3           750  2016-12-02         35    S5     3
    2          1000  2016-12-02         30    S1     4
    7           600  2016-12-02         30    S7     5
    4           500  2016-12-02         20    S4     6
    9           500  2016-12-02         20   S10     6
    0           300  2016-12-02         10    S3     7
    8            50  2016-12-02          2    S9     8
    6             0  2016-12-02          0    S6     9
    
    0 讨论(0)
  • 2020-12-16 19:29

    pd.factorize will generate unique values for each unique element of a iterable. We only need to sort in the order we'd like, then factorize. In order to do multiple columns, we convert the sorted result to tuples.

    cols = ['SaleCount', 'TotalRevenue']
    tups = df[cols].sort_values(cols, ascending=False).apply(tuple, 1)
    f, i = pd.factorize(tups)
    factorized = pd.Series(f + 1, tups.index)
    
    df.assign(Rank=factorized)
    
             Date  SaleCount  TotalRevenue shops  Rank
    1  2016-12-02        100          9000    S2     1
    5  2016-12-02        100          2000    S8     2
    3  2016-12-02         35           750    S5     3
    2  2016-12-02         30          1000    S1     4
    7  2016-12-02         30           600    S7     5
    4  2016-12-02         20           500    S4     6
    9  2016-12-02         20           500   S10     6
    0  2016-12-02         10           300    S3     7
    8  2016-12-02          2            50    S9     8
    6  2016-12-02          0             0    S6     9
    
    0 讨论(0)
  • 2020-12-16 19:33

    Another way would be to type-cast both the columns of interest to str and combine them by concatenating them. Convert these back to numerical values so that they could be differentiated based on their magnitude.

    In method=dense, ranks of duplicated values would remain unchanged. (Here: 6)

    Since you want to rank these in their descending order, specifying ascending=False in Series.rank() would let you achieve the desired result.

    col1 = df["SaleCount"].astype(str) 
    col2 = df["TotalRevenue"].astype(str)
    df['Rank'] = (col1+col2).astype(int).rank(method='dense', ascending=False).astype(int)
    df.sort_values('Rank')
    

    0 讨论(0)
提交回复
热议问题