How to implement sql coalesce in pandas

后端 未结 5 1445
Happy的楠姐
Happy的楠姐 2020-12-10 11:54

I have a data frame like

df = pd.DataFrame({\"A\":[1,2,np.nan],\"B\":[np.nan,10,np.nan], \"C\":[5,10,7]})
     A     B   C
0  1.0   NaN   5
1  2.0  10.0  10         


        
相关标签:
5条回答
  • 2020-12-10 12:38

    Another way is to explicitly fill column D with A,B,C in that order.

    df['D'] = np.nan
    df['D'] = df.D.fillna(df.A).fillna(df.B).fillna(df.C)
    
    0 讨论(0)
  • 2020-12-10 12:38

    Another approach is to use the combine_first method of a pd.Series. Using your example df,

    >>> import pandas as pd
    >>> import numpy as np
    >>> df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
    >>> df
         A     B   C
    0  1.0   NaN   5
    1  2.0  10.0  10
    2  NaN   NaN   7
    

    we have

    >>> df.A.combine_first(df.B).combine_first(df.C)
    0    1.0
    1    2.0
    2    7.0
    

    We can use reduce to abstract this pattern to work with an arbitrary number of columns.

    >>> from functools import reduce
    >>> cols = [df[c] for c in df.columns]
    >>> reduce(lambda acc, col: acc.combine_first(col), cols)
    0    1.0
    1    2.0
    2    7.0
    Name: A, dtype: float64
    

    Let's put this all together in a function.

    >>> def coalesce(*args):
    ...     return reduce(lambda acc, col: acc.combine_first(col), args)
    ...
    >>> coalesce(*cols)
    0    1.0
    1    2.0
    2    7.0
    Name: A, dtype: float64
    
    0 讨论(0)
  • 2020-12-10 12:40

    I think you need bfill with selecting first column by iloc:

    df['D'] = df.bfill(axis=1).iloc[:,0]
    print (df)
         A     B   C    D
    0  1.0   NaN   5  1.0
    1  2.0  10.0  10  2.0
    2  NaN   NaN   7  7.0
    

    same as:

    df['D'] = df.fillna(method='bfill',axis=1).iloc[:,0]
    print (df)
         A     B   C    D
    0  1.0   NaN   5  1.0
    1  2.0  10.0  10  2.0
    2  NaN   NaN   7  7.0
    
    0 讨论(0)
  • 2020-12-10 12:52

    There is already a method for Series in Pandas that does this:

    df['D'] = df['A'].combine_first(df['C'])
    

    Or just stack them if you want to look up values sequentially:

    df['D'] = df['A'].combine_first(df['B']).combine_first(df['C'])
    

    This outputs the following:

    >>> df
         A     B   C    D
    0  1.0   NaN   5  1.0
    1  2.0  10.0  10  2.0
    2  NaN   NaN   7  7.0
    
    0 讨论(0)
  • 2020-12-10 12:53

    option 1
    pandas

    df.assign(D=df.lookup(df.index, df.isnull().idxmin(1)))
    
         A     B   C    D
    0  1.0   NaN   5  1.0
    1  2.0  10.0  10  2.0
    2  NaN   NaN   7  7.0
    

    option 2
    numpy

    v = df.values
    j = np.isnan(v).argmin(1)
    df.assign(D=v[np.arange(len(v)), j])
    
         A     B   C    D
    0  1.0   NaN   5  1.0
    1  2.0  10.0  10  2.0
    2  NaN   NaN   7  7.0
    

    naive time test
    over given data

    over larger data

    0 讨论(0)
提交回复
热议问题