Pandas rank by multiple columns

孤者浪人 提交于 2019-11-30 14:56:26

Another way would be to type-cast both the columns of interest to str and combine them by concatenating them. Convert these back to numerical values so that they could be differentiated based on their magnitude.

In method=dense, ranks of duplicated values would remain unchanged. (Here: 6)

Since you want to rank these in their descending order, specifying ascending=False in Series.rank() would let you achieve the desired result.

col1 = df["SaleCount"].astype(str) 
col2 = df["TotalRevenue"].astype(str)
df['Rank'] = (col1+col2).astype(int).rank(method='dense', ascending=False).astype(int)
df.sort_values('Rank')

pd.factorize will generate unique values for each unique element of a iterable. We only need to sort in the order we'd like, then factorize. In order to do multiple columns, we convert the sorted result to tuples.

cols = ['SaleCount', 'TotalRevenue']
tups = df[cols].sort_values(cols, ascending=False).apply(tuple, 1)
f, i = pd.factorize(tups)
factorized = pd.Series(f + 1, tups.index)

df.assign(Rank=factorized)

         Date  SaleCount  TotalRevenue shops  Rank
1  2016-12-02        100          9000    S2     1
5  2016-12-02        100          2000    S8     2
3  2016-12-02         35           750    S5     3
2  2016-12-02         30          1000    S1     4
7  2016-12-02         30           600    S7     5
4  2016-12-02         20           500    S4     6
9  2016-12-02         20           500   S10     6
0  2016-12-02         10           300    S3     7
8  2016-12-02          2            50    S9     8
6  2016-12-02          0             0    S6     9

(The correct way to rank two (nonnegative) int columns is as per Nickil Maveli's answer, to cast them to string, concatenate them and cast back to int.)

However here's a shortcut if you know that TotalRevenue is constrained to some range e.g. 0 to MAX_REVENUE=100,000 ; directly manipulate them as nonnegative integers:

df['Rank'] = (df['SaleCount']*MAX_REVENUE + df['TotalRevenue']).rank(method='dense', ascending=False).astype(int)

df.sort_values('Rank2')

The generic way to do that is to group the desired fiels in a tuple, whatever the types.

df["Rank"] = df[["SaleCount","TotalRevenue"]].apply(tuple,axis=1)\
             .rank(method='dense',ascending=False).astype(int)

df.sort_values("Rank")

   TotalRevenue        Date  SaleCount shops  Rank
1          9000  2016-12-02        100    S2     1
5          2000  2016-12-02        100    S8     2
3           750  2016-12-02         35    S5     3
2          1000  2016-12-02         30    S1     4
7           600  2016-12-02         30    S7     5
4           500  2016-12-02         20    S4     6
9           500  2016-12-02         20   S10     6
0           300  2016-12-02         10    S3     7
8            50  2016-12-02          2    S9     8
6             0  2016-12-02          0    S6     9
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!