Merge and update dataframes based on a subset of their columns

后端 未结 3 2036
梦谈多话
梦谈多话 2021-01-20 01:45

I wonder whether there is the fastest code to replace the two for loops, assuming the df size is very large. In my real case, each dataframe is 200 rows and 25 columns.

3条回答
  •  甜味超标
    2021-01-20 02:39

    Some cleaning:

    def clean_df(df):
        df.columns = df.iloc[0]
        df.columns.name = None        
        df = df.iloc[1:].reset_index()
    
        return df
    
    df1 = clean_df(df1)
    df1
       index Name Unit Attribute  Date
    0      1    a    A         1  2014
    1      2    b    B         2  2015
    2      3    c    C         3  2016
    3      4    d    D         4  2017
    4      5    e    E         5  2018
    
    df2 = clean_df(df2)
    df2    
       index Name Unit  Date
    0      1    a    F  2019
    1      2    b    G  2020
    2      3    e    H  2021
    3      4    f    I  2022
    

    Use merge, specifying on=Name, so the other columns are not considered.

    cols = ['Name', 'Unit_y', 'Attribute', 'Date_y']
    df1 = df1.merge(df2, how='left', on='Name')[cols]\
                  .rename(columns=lambda x: x.split('_')[0]).fillna(df1)
    
    df1
      Name Unit Attribute  Date
    0    a    F         1  2019
    1    b    G         2  2020
    2    c    C         3  2016
    3    d    D         4  2017
    4    e    H         5  2021
    

提交回复
热议问题