问题
I have the following sample CSV.
,cid1,cid2,cid3
rid1,0.1,0.4,0.3
rid2,1.0,0.1,0.5
rid3,0.2,0.5,0.1
rid4,0.3,0.4,0.8
rid5,0.2,0.3,0.7
rid6,0.9,0.2,0.1
rid7,0.4,0.8,0.9
rid8,0.6,0.5,0.7
rid9,0.3,0.9,0.4
I want to show n rows with the highest value for every column in the file. Ideally, I would like to get the following output (for n = 3).
cid1 rid2 1.0
cid1 rid6 0.9
cid1 rid8 0.6
# Blank lines separating columns are optional.
cid2 rid9 0.9
cid2 rid7 0.8
cid2 rid8 0.5
cid3 rid7 0.9
cid3 rid4 0.8
cid3 rid8 0.7
This is what I have so far.
import pandas as pd
df = pd.read_csv("input.csv", index_col=0)
n = 3
for col in df:
print(col)
print(df.sort_values(col, ascending=False)[:n][col].to_string())
which gives me almost what I want.
cid1
rid2 1.0
rid6 0.9
rid8 0.6
cid2
rid9 0.9
rid7 0.8
rid3 0.5
cid3
rid7 0.9
rid4 0.8
rid5 0.7
Also, there is a small issue with cid2, for which rid8 and rid3 have the same value of 0.5. In that case, it would be perfect to show them both in separate lines, but I'm aware it will probably complicate the code much further so I would be happy with the basic solution.
回答1:
Are you looking for GroupBy.nlargest?
df.unstack().groupby(level=0, group_keys=False).nlargest(3)
cid1 rid2 1.0
rid6 0.9
rid8 0.6
cid2 rid9 0.9
rid7 0.8
rid3 0.5
cid3 rid7 0.9
rid4 0.8
rid5 0.7
dtype: float64
Trying to mirror OP's "expected" result:
print(df.unstack()
.groupby(level=0, group_keys=False)
.nlargest(3)
.reset_index()
.to_string(header=None, index=False))
cid1 rid2 1.0
cid1 rid6 0.9
cid1 rid8 0.6
cid2 rid9 0.9
cid2 rid7 0.8
cid2 rid3 0.5
cid3 rid7 0.9
cid3 rid4 0.8
cid3 rid5 0.7
来源:https://stackoverflow.com/questions/65438075/show-top-n-rows-for-every-column-in-pandas-data-frame