Get first non-null value per row

后端未结

关注

 4  1814

轻奢々 2020-12-20 21:04

I have a sample dataframe show as below. For each line, I want to check the c1 first, if it is not null, then check c2. By this way, find the first notnull column and store

4条回答

無奈伤痛 (楼主)

2020-12-20 21:54

Use back filling NaNs first and then select first column by iloc:

df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')

Or:

df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')

print (df)
   ID   c1   c2  c3   c4 result
0   1    a    b   a  NaN      a
1   2  NaN   cc  dd   cc     cc
2   3  NaN   ee  ff   ee     ee
3   4  NaN  NaN  gg   gg     gg

Performance:

df = pd.concat([df] * 1000, ignore_index=True)


In [220]: %timeit df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.78 ms per loop

In [221]: %timeit df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.7 ms per loop

#jpp solution
In [222]: %%timeit
     ...: cols = df.iloc[:, 1:].T.apply(pd.Series.first_valid_index)
     ...: 
     ...: df['result'] = [df.loc[i, cols[i]] for i in range(len(df.index))]
     ...: 
1 loop, best of 3: 180 ms per loop

#cᴏʟᴅsᴘᴇᴇᴅ'  s solution
In [223]: %timeit df['result'] = df.stack().groupby(level=0).first()
1 loop, best of 3: 606 ms per loop

0 讨论(0)

查看其它4个回答