How to get column name for second largest row value in pandas DataFrame

旧巷老猫 提交于 2020-02-15 07:17:14

问题


I have a pretty simple question - I think - but it seems I can't wrap my head around this one. I am a beginner with Python and Pandas. I searched the forum but couldn't get a (recent) answer that fits my need.

I have a data frame such as this one:

df = pd.DataFrame({'A': [1.1, 2.7, 5.3], 'B': [2, 10, 9], 'C': [3.3, 5.4, 1.5], 'D': [4, 7, 15]}, index = ['a1', 'a2', 'a3'])

Which gives:

          A   B    C   D
    a1  1.1   2  3.3   4
    a2  2.7  10  5.4   7
    a3  5.3   9  1.5  15

My question is simple : I would like to add a column that gives the column name of the second max value of each row.

I have written a simple function which returns the second max value for each row

def get_second_best(x):
    return sorted(x)[-2]

df['value'] = df.apply(lambda row: get_second_best(row), axis=1)

Which gives:

      A   B    C   D  value
a1  1.1   2  3.3   4    3.3
a2  2.7  10  5.4   7    7.0
a3  5.3   9  1.5  15    9.0

But I can't find how to display the column name in the 'value' column, instead of the value... I'm thinking about boolean indexing (comparing the 'value' column values with each row), but I haven't figured out how to do it.

To be clearer, I would like it to be:

      A   B    C   D  value
a1  1.1   2  3.3   4    C
a2  2.7  10  5.4   7    D
a3  5.3   9  1.5  15    B

Any help (and explanation) appreciated!


回答1:


One approach would be to pick out the two largest elements in each row using Series.nlargest and find the column corresponding to the smallest of those using Series.idxmin:

In [45]: df['value'] = df.T.apply(lambda x: x.nlargest(2).idxmin())

In [46]: df
Out[46]:
      A   B    C   D value
a1  1.1   2  3.3   4     C
a2  2.7  10  5.4   7     D
a3  5.3   9  1.5  15     B

It is worth noting that picking Series.idxmin over DataFrame.idxmin can make a difference performance-wise:

df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D'])
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 39.8 ms ± 2.66 ms
%timeit df.T.apply(lambda x: x.nlargest(2)).idxmin() # 53.6 ms ± 362 µs

Edit: Adding to @jpp's answer, if performance matters, you can gain a significant speed-up by using Numba, writing the code as if this were C and compiling it:

from numba import njit, prange

@njit
def arg_second_largest(arr):
    args = np.empty(len(arr), dtype=np.int_)
    for k in range(len(arr)):
        a = arr[k]
        second = np.NINF
        arg_second = 0
        first = np.NINF
        arg_first = 0
        for i in range(len(a)):
            x = a[i]
            if x >= first:
                second = first
                first = x
                arg_second = arg_first
                arg_first = i
            elif x >= second:
                second = x
                arg_second = i
        args[k] = arg_second
    return args

Let's compare the different solutions on two sets of data with shapes (1000, 4) and (1000, 1000) respectively:

df = pd.DataFrame(np.random.normal(size=(1000, 4)))
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin())     # 429 ms ± 5.1 ms
%timeit df.columns[df.values.argsort(1)[:, -2]]          # 94.7 µs ± 2.15 µs
%timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 101 µs ± 1.07 µs
%timeit df.columns[arg_second_largest(df.values)]        # 74.1 µs ± 775 ns

df = pd.DataFrame(np.random.normal(size=(1000, 1000)))
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin())     # 1.8 s ± 49.7 ms
%timeit df.columns[df.values.argsort(1)[:, -2]]          # 52.1 ms ± 1.44 ms
%timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 14.6 ms ± 145 µs
%timeit df.columns[arg_second_largest(df.values)]        # 1.11 ms ± 22.6 µs

In the last case, I was able to squeeze out a bit more and get the benchmark down to 852 µs by using @njit(parallel=True) and replacing the outer loop with for k in prange(len(arr)).




回答2:


Here's one solution using NumPy. The idea is to argsort the values in your dataframe, select the second last column, and finally use this to index df.column.

df['value'] = df.columns[df.values.argsort(1)[:, -2]]

print(df)

      A   B    C   D value
a1  1.1   2  3.3   4     C
a2  2.7  10  5.4   7     D
a3  5.3   9  1.5  15     B

You should find this more efficient than Pandas-based solutions:

# Python 3.6, NumPy 1.14.3, Pandas 0.23.0

np.random.seed(0)

df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D'])

%timeit df.T.apply(lambda x: x.nlargest(2).idxmin())  # 49.6 ms
%timeit df.T.apply(lambda x: x.nlargest(2)).idxmin()  # 73.2 ms
%timeit df.columns[df.values.argsort(1)[:, -2]]       # 36.3 µs


来源:https://stackoverflow.com/questions/52464947/how-to-get-column-name-for-second-largest-row-value-in-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!