问题
I am trying to rank a pandas data frame based on two columns. I can rank it based on one column, but how can to rank it based on two columns? 'SaleCount', then 'TotalRevenue'?
import pandas as pd
df = pd.DataFrame({'TotalRevenue':[300,9000,1000,750,500,2000,0,600,50,500],
'Date':['2016-12-02' for i in range(10)],
'SaleCount':[10,100,30,35,20,100,0,30,2,20],
'shops':['S3','S2','S1','S5','S4','S8','S6','S7','S9','S10']})
df['Rank'] = df.SaleCount.rank(method='dense',ascending = False).astype(int)
#df['Rank'] = df.TotalRevenue.rank(method='dense',ascending = False).astype(int)
df.sort_values(['Rank'], inplace=True)
print(df)
current output:
Date SaleCount TotalRevenue shops Rank
1 2016-12-02 100 9000 S2 1
5 2016-12-06 100 2000 S8 1
3 2016-12-04 35 750 S5 2
2 2016-12-03 30 1000 S1 3
7 2016-12-08 30 600 S7 3
9 2016-12-10 20 500 S10 4
4 2016-12-05 20 500 S4 4
0 2016-12-01 10 300 S3 5
8 2016-12-09 2 50 S9 6
6 2016-12-07 0 0 S6 7
I'm trying to generate an output like this:
Date SaleCount TotalRevenue shops Rank
1 2016-12-02 100 9000 S2 1
5 2016-12-02 100 2000 S8 2
3 2016-12-02 35 750 S5 3
2 2016-12-02 30 1000 S1 4
7 2016-12-02 30 600 S7 5
9 2016-12-02 20 500 S10 6
4 2016-12-02 20 500 S4 6
0 2016-12-02 10 300 S3 7
8 2016-12-02 2 50 S9 8
6 2016-12-02 0 0 S6 9
回答1:
Another way would be to type-cast both the columns of interest to str
and combine them by concatenating them. Convert these back to numerical values so that they could be differentiated based on their magnitude.
In method=dense
, ranks of duplicated values would remain unchanged. (Here: 6)
Since you want to rank these in their descending order, specifying ascending=False
in Series.rank() would let you achieve the desired result.
col1 = df["SaleCount"].astype(str)
col2 = df["TotalRevenue"].astype(str)
df['Rank'] = (col1+col2).astype(int).rank(method='dense', ascending=False).astype(int)
df.sort_values('Rank')
回答2:
pd.factorize
will generate unique values for each unique element of a iterable. We only need to sort in the order we'd like, then factorize. In order to do multiple columns, we convert the sorted result to tuples.
cols = ['SaleCount', 'TotalRevenue']
tups = df[cols].sort_values(cols, ascending=False).apply(tuple, 1)
f, i = pd.factorize(tups)
factorized = pd.Series(f + 1, tups.index)
df.assign(Rank=factorized)
Date SaleCount TotalRevenue shops Rank
1 2016-12-02 100 9000 S2 1
5 2016-12-02 100 2000 S8 2
3 2016-12-02 35 750 S5 3
2 2016-12-02 30 1000 S1 4
7 2016-12-02 30 600 S7 5
4 2016-12-02 20 500 S4 6
9 2016-12-02 20 500 S10 6
0 2016-12-02 10 300 S3 7
8 2016-12-02 2 50 S9 8
6 2016-12-02 0 0 S6 9
回答3:
(The correct way to rank two (nonnegative) int columns is as per Nickil Maveli's answer, to cast them to string, concatenate them and cast back to int.)
However here's a shortcut if you know that TotalRevenue
is constrained to some range e.g. 0 to MAX_REVENUE=100,000 ; directly manipulate them as nonnegative integers:
df['Rank'] = (df['SaleCount']*MAX_REVENUE + df['TotalRevenue']).rank(method='dense', ascending=False).astype(int)
df.sort_values('Rank2')
回答4:
The generic way to do that is to group the desired fiels in a tuple, whatever the types.
df["Rank"] = df[["SaleCount","TotalRevenue"]].apply(tuple,axis=1)\
.rank(method='dense',ascending=False).astype(int)
df.sort_values("Rank")
TotalRevenue Date SaleCount shops Rank
1 9000 2016-12-02 100 S2 1
5 2000 2016-12-02 100 S8 2
3 750 2016-12-02 35 S5 3
2 1000 2016-12-02 30 S1 4
7 600 2016-12-02 30 S7 5
4 500 2016-12-02 20 S4 6
9 500 2016-12-02 20 S10 6
0 300 2016-12-02 10 S3 7
8 50 2016-12-02 2 S9 8
6 0 2016-12-02 0 S6 9
回答5:
sort_values
+ GroupBy.ngroup
This will give the dense
ranking.
Columns should be sorted in the desired order prior to the groupby. Specifying sort=False
within the groupby
then respects this sorting so that groups are labeled in the order they appear within the sorted DataFrame.
cols = ['SaleCount', 'TotalRevenue']
df['Rank'] = df.sort_values(cols, ascending=False).groupby(cols, sort=False).ngroup() + 1
Output:
print(df.sort_values('Rank'))
TotalRevenue Date SaleCount shops Rank
1 9000 2016-12-02 100 S2 1
5 2000 2016-12-02 100 S8 2
3 750 2016-12-02 35 S5 3
2 1000 2016-12-02 30 S1 4
7 600 2016-12-02 30 S7 5
4 500 2016-12-02 20 S4 6
9 500 2016-12-02 20 S10 6
0 300 2016-12-02 10 S3 7
8 50 2016-12-02 2 S9 8
6 0 2016-12-02 0 S6 9
来源:https://stackoverflow.com/questions/41974374/pandas-rank-by-multiple-columns