Merge and update dataframes based on a subset of their columns

后端未结

关注

 3  2036

梦谈多话 2021-01-20 01:45

I wonder whether there is the fastest code to replace the two for loops, assuming the df size is very large. In my real case, each dataframe is 200 rows and 25 columns.

3条回答

甜味超标 (楼主)

2021-01-20 02:39

Some cleaning:

def clean_df(df):
    df.columns = df.iloc[0]
    df.columns.name = None        
    df = df.iloc[1:].reset_index()

    return df

df1 = clean_df(df1)
df1
   index Name Unit Attribute  Date
0      1    a    A         1  2014
1      2    b    B         2  2015
2      3    c    C         3  2016
3      4    d    D         4  2017
4      5    e    E         5  2018

df2 = clean_df(df2)
df2    
   index Name Unit  Date
0      1    a    F  2019
1      2    b    G  2020
2      3    e    H  2021
3      4    f    I  2022

Use merge, specifying on=Name, so the other columns are not considered.

cols = ['Name', 'Unit_y', 'Attribute', 'Date_y']
df1 = df1.merge(df2, how='left', on='Name')[cols]\
              .rename(columns=lambda x: x.split('_')[0]).fillna(df1)

df1
  Name Unit Attribute  Date
0    a    F         1  2019
1    b    G         2  2020
2    c    C         3  2016
3    d    D         4  2017
4    e    H         5  2021

0 讨论(0)

查看其它3个回答