Efficiently replace values from a column to another column Pandas DataFrame

前端 未结 3 482
温柔的废话
温柔的废话 2020-12-08 04:52

I have a Pandas DataFrame like this:

   col1 col2 col3
1   0.2  0.3  0.3
2   0.2  0.3  0.3
3     0  0.4  0.4
4     0    0  0.3
5     0    0    0
6   0.1  0.         


        
相关标签:
3条回答
  • 2020-12-08 05:30

    I'm not sure if it's faster, but you're right that you can slice the dataframe to get your desired result.

    df.col1[df.col1 == 0] = df.col2
    df.col1[df.col1 == 0] = df.col3
    print(df)
    

    Output:

       col1  col2  col3
    0   0.2   0.3   0.3
    1   0.2   0.3   0.3
    2   0.4   0.4   0.4
    3   0.3   0.0   0.3
    4   0.0   0.0   0.0
    5   0.1   0.4   0.4
    

    Alternatively if you want it to be more terse (though I don't know if it's faster) you can combine what you did with what I did.

    df.col1[df.col1 == 0] = df.col2.replace(0, df.col3)
    print(df)
    

    Output:

       col1  col2  col3
    0   0.2   0.3   0.3
    1   0.2   0.3   0.3
    2   0.4   0.4   0.4
    3   0.3   0.0   0.3
    4   0.0   0.0   0.0
    5   0.1   0.4   0.4
    
    0 讨论(0)
  • 2020-12-08 05:39

    approach using pd.DataFrame.where and pd.DataFrame.bfill

    df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
    df
    

    Another approach using np.argmax

    def pir2(df):
        slc = (df.values != 0).argmax(axis=1)
        return df.values[np.arange(slc.shape[0]), slc]
    

    I know there is a better way to use numpy to slice. I just can't think of it at the moment.

    0 讨论(0)
  • 2020-12-08 05:45

    Using np.where is faster. Using a similar pattern as you used with replace:

    df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
    df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
    

    However, using a nested np.where is slightly faster:

    df['col1'] = np.where(df['col1'] == 0, 
                          np.where(df['col2'] == 0, df['col3'], df['col2']),
                          df['col1'])
    

    Timings

    Using the following setup to produce a larger sample DataFrame and timing functions:

    df = pd.concat([df]*10**4, ignore_index=True)
    
    def root_nested(df):
        df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1'])
        return df
    
    def root_split(df):
        df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
        df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
        return df
    
    def pir2(df):
        df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
        return df
    
    def pir2_2(df):
        slc = (df.values != 0).argmax(axis=1)
        return df.values[np.arange(slc.shape[0]), slc]
    
    def andrew(df):
        df.col1[df.col1 == 0] = df.col2
        df.col1[df.col1 == 0] = df.col3
        return df
    
    def pablo(df):
        df['col1'] = df['col1'].replace(0,df['col2'])
        df['col1'] = df['col1'].replace(0,df['col3'])
        return df
    

    I get the following timings:

    %timeit root_nested(df.copy())
    100 loops, best of 3: 2.25 ms per loop
    
    %timeit root_split(df.copy())
    100 loops, best of 3: 2.62 ms per loop
    
    %timeit pir2(df.copy())
    100 loops, best of 3: 6.25 ms per loop
    
    %timeit pir2_2(df.copy())
    1 loop, best of 3: 2.4 ms per loop
    
    %timeit andrew(df.copy())
    100 loops, best of 3: 8.55 ms per loop
    

    I tried timing your method, but it's been running for multiple minutes without completing. As a comparison, timing your method on just the 6 row example DataFrame (not the much larger one tested above) took 12.8 ms.

    0 讨论(0)
提交回复
热议问题