How to get column name for second largest row value in pandas DataFrame

前端 未结 2 1779
刺人心
刺人心 2020-12-20 15:47

I have a pretty simple question - I think - but it seems I can\'t wrap my head around this one. I am a beginner with Python and Pandas. I searched the forum but couldn\'t ge

相关标签:
2条回答
  • 2020-12-20 16:22

    Here's one solution using NumPy. The idea is to argsort the values in your dataframe, select the second last column, and finally use this to index df.column.

    df['value'] = df.columns[df.values.argsort(1)[:, -2]]
    
    print(df)
    
          A   B    C   D value
    a1  1.1   2  3.3   4     C
    a2  2.7  10  5.4   7     D
    a3  5.3   9  1.5  15     B
    

    You should find this more efficient than Pandas-based solutions:

    # Python 3.6, NumPy 1.14.3, Pandas 0.23.0
    
    np.random.seed(0)
    
    df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D'])
    
    %timeit df.T.apply(lambda x: x.nlargest(2).idxmin())  # 49.6 ms
    %timeit df.T.apply(lambda x: x.nlargest(2)).idxmin()  # 73.2 ms
    %timeit df.columns[df.values.argsort(1)[:, -2]]       # 36.3 µs
    
    0 讨论(0)
  • 2020-12-20 16:23

    One approach would be to pick out the two largest elements in each row using Series.nlargest and find the column corresponding to the smallest of those using Series.idxmin:

    In [45]: df['value'] = df.T.apply(lambda x: x.nlargest(2).idxmin())
    
    In [46]: df
    Out[46]:
          A   B    C   D value
    a1  1.1   2  3.3   4     C
    a2  2.7  10  5.4   7     D
    a3  5.3   9  1.5  15     B
    

    It is worth noting that picking Series.idxmin over DataFrame.idxmin can make a difference performance-wise:

    df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D'])
    %timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 39.8 ms ± 2.66 ms
    %timeit df.T.apply(lambda x: x.nlargest(2)).idxmin() # 53.6 ms ± 362 µs
    

    Edit: Adding to @jpp's answer, if performance matters, you can gain a significant speed-up by using Numba, writing the code as if this were C and compiling it:

    from numba import njit, prange
    
    @njit
    def arg_second_largest(arr):
        args = np.empty(len(arr), dtype=np.int_)
        for k in range(len(arr)):
            a = arr[k]
            second = np.NINF
            arg_second = 0
            first = np.NINF
            arg_first = 0
            for i in range(len(a)):
                x = a[i]
                if x >= first:
                    second = first
                    first = x
                    arg_second = arg_first
                    arg_first = i
                elif x >= second:
                    second = x
                    arg_second = i
            args[k] = arg_second
        return args
    

    Let's compare the different solutions on two sets of data with shapes (1000, 4) and (1000, 1000) respectively:

    df = pd.DataFrame(np.random.normal(size=(1000, 4)))
    %timeit df.T.apply(lambda x: x.nlargest(2).idxmin())     # 429 ms ± 5.1 ms
    %timeit df.columns[df.values.argsort(1)[:, -2]]          # 94.7 µs ± 2.15 µs
    %timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 101 µs ± 1.07 µs
    %timeit df.columns[arg_second_largest(df.values)]        # 74.1 µs ± 775 ns
    
    df = pd.DataFrame(np.random.normal(size=(1000, 1000)))
    %timeit df.T.apply(lambda x: x.nlargest(2).idxmin())     # 1.8 s ± 49.7 ms
    %timeit df.columns[df.values.argsort(1)[:, -2]]          # 52.1 ms ± 1.44 ms
    %timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 14.6 ms ± 145 µs
    %timeit df.columns[arg_second_largest(df.values)]        # 1.11 ms ± 22.6 µs
    

    In the last case, I was able to squeeze out a bit more and get the benchmark down to 852 µs by using @njit(parallel=True) and replacing the outer loop with for k in prange(len(arr)).

    0 讨论(0)
提交回复
热议问题