I wonder whether there is the fastest code to replace the two for loops, assuming the df size is very large. In my real case, each dataframe is 200 rows and 25 columns.
Some cleaning:
def clean_df(df):
df.columns = df.iloc[0]
df.columns.name = None
df = df.iloc[1:].reset_index()
return df
df1 = clean_df(df1)
df1
index Name Unit Attribute Date
0 1 a A 1 2014
1 2 b B 2 2015
2 3 c C 3 2016
3 4 d D 4 2017
4 5 e E 5 2018
df2 = clean_df(df2)
df2
index Name Unit Date
0 1 a F 2019
1 2 b G 2020
2 3 e H 2021
3 4 f I 2022
Use merge
, specifying on=Name
, so the other columns are not considered.
cols = ['Name', 'Unit_y', 'Attribute', 'Date_y']
df1 = df1.merge(df2, how='left', on='Name')[cols]\
.rename(columns=lambda x: x.split('_')[0]).fillna(df1)
df1
Name Unit Attribute Date
0 a F 1 2019
1 b G 2 2020
2 c C 3 2016
3 d D 4 2017
4 e H 5 2021