I have a pretty simple question - I think - but it seems I can\'t wrap my head around this one. I am a beginner with Python and Pandas. I searched the forum but couldn\'t ge
Here's one solution using NumPy. The idea is to argsort
the values in your dataframe, select the second last column, and finally use this to index df.column
.
df['value'] = df.columns[df.values.argsort(1)[:, -2]]
print(df)
A B C D value
a1 1.1 2 3.3 4 C
a2 2.7 10 5.4 7 D
a3 5.3 9 1.5 15 B
You should find this more efficient than Pandas-based solutions:
# Python 3.6, NumPy 1.14.3, Pandas 0.23.0
np.random.seed(0)
df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D'])
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 49.6 ms
%timeit df.T.apply(lambda x: x.nlargest(2)).idxmin() # 73.2 ms
%timeit df.columns[df.values.argsort(1)[:, -2]] # 36.3 µs
One approach would be to pick out the two largest elements in each row using Series.nlargest and find the column corresponding to the smallest of those using Series.idxmin:
In [45]: df['value'] = df.T.apply(lambda x: x.nlargest(2).idxmin())
In [46]: df
Out[46]:
A B C D value
a1 1.1 2 3.3 4 C
a2 2.7 10 5.4 7 D
a3 5.3 9 1.5 15 B
It is worth noting that picking Series.idxmin
over DataFrame.idxmin
can make a difference performance-wise:
df = pd.DataFrame(np.random.normal(size=(100, 4)), columns=['A', 'B', 'C', 'D'])
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 39.8 ms ± 2.66 ms
%timeit df.T.apply(lambda x: x.nlargest(2)).idxmin() # 53.6 ms ± 362 µs
Edit: Adding to @jpp's answer, if performance matters, you can gain a significant speed-up by using Numba, writing the code as if this were C and compiling it:
from numba import njit, prange
@njit
def arg_second_largest(arr):
args = np.empty(len(arr), dtype=np.int_)
for k in range(len(arr)):
a = arr[k]
second = np.NINF
arg_second = 0
first = np.NINF
arg_first = 0
for i in range(len(a)):
x = a[i]
if x >= first:
second = first
first = x
arg_second = arg_first
arg_first = i
elif x >= second:
second = x
arg_second = i
args[k] = arg_second
return args
Let's compare the different solutions on two sets of data with shapes (1000, 4)
and (1000, 1000)
respectively:
df = pd.DataFrame(np.random.normal(size=(1000, 4)))
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 429 ms ± 5.1 ms
%timeit df.columns[df.values.argsort(1)[:, -2]] # 94.7 µs ± 2.15 µs
%timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 101 µs ± 1.07 µs
%timeit df.columns[arg_second_largest(df.values)] # 74.1 µs ± 775 ns
df = pd.DataFrame(np.random.normal(size=(1000, 1000)))
%timeit df.T.apply(lambda x: x.nlargest(2).idxmin()) # 1.8 s ± 49.7 ms
%timeit df.columns[df.values.argsort(1)[:, -2]] # 52.1 ms ± 1.44 ms
%timeit df.columns[np.argpartition(df.values, -2)[:,-2]] # 14.6 ms ± 145 µs
%timeit df.columns[arg_second_largest(df.values)] # 1.11 ms ± 22.6 µs
In the last case, I was able to squeeze out a bit more and get the benchmark down to 852 µs by using @njit(parallel=True)
and replacing the outer loop with for k in prange(len(arr))
.