问题
I'm missing something in the syntax of merging in pandas.
I have the following 2 data frames:
>>> dfA
s_name geo zip date value
0 A002X zip 60601 2010 None
1 A002Y zip 60601 2010 None
2 A003X zip 60601 2010 None
3 A003Y zip 60601 2010 None
(or potentially some values exist which will not overlap with dfB:
>>> dfA_alternate
s_name geo zip date value
0 A002X zip 60601 2010 NaN
1 A002Y zip 60601 2010 2.0
2 A003X zip 60601 2010 NaN
3 A003Y zip 60601 2010 NaN
)
And
>>> dfB
s_name geo zip date value
0 A002X zip 60601 2010 1.0
1 A002Y zip 60601 2010 NaN
3 A003Y zip 60601 2010 4.0
I'd like to join the data present in dfB onto dfA, like so:
>>> new
s_name geo zip date value
0 A002X zip 60601 2010 1.0
1 A002Y zip 60601 2010 NaN
2 A003X zip 60601 2010 NaN
3 A003Y zip 60601 2010 4.0
(or
>>> new_alternate
s_name geo zip date value
0 A002X zip 60601 2010 1.0
1 A002Y zip 60601 2010 2.0
2 A003X zip 60601 2010 NaN
3 A003Y zip 60601 2010 4.0
)
However, what seems like natural syntax actually makes extra columns:
>>> pd.merge(dfA,dfB,on=["s_name","geo","zip","date"],how="left")
s_name geo zip date value_x value_y
0 A002X zip 60601 2010 None 1.0
1 A002Y zip 60601 2010 None NaN
2 A003X zip 60601 2010 None NaN
3 A003Y zip 60601 2010 None 4.0
(
>>> # alternate
>>> pd.merge(dfA_alterate,dfB,on=["s_name","geo","zip","date"],how="left")
s_name geo zip date value_x value_y
0 A002X zip 60601 2010 NaN 1.0
1 A002Y zip 60601 2010 2.0 NaN
2 A003X zip 60601 2010 NaN NaN
3 A003Y zip 60601 2010 NaN 4.0
)
There's value_x
and value_y
when I'd rather just have value.
I get that I can clean this up after the fact with:
new["value"] = new.apply(lambda r: r.value_x or r.value_y, axis=1)
new.drop(["value_x", "value_y"], axis=1, inplace=True)
But I imagine there's just merge syntax I need to change to get it right without post-processing. What am I missing?
回答1:
I think you need combine_first with MultiIndex
created by set_index:
cols = ["s_name","geo","zip","date"]
df = dfA.set_index(cols).combine_first(dfB.set_index(cols)).reset_index()
print (df)
s_name geo zip date value
0 A002X zip 60601 2010 1.0
1 A002Y zip 60601 2010 2.0
2 A003X zip 60601 2010 NaN
3 A003Y zip 60601 2010 4.0
Or update:
df = dfA.set_index(cols)
df.update(dfB.set_index(cols))
df = df.reset_index()
来源:https://stackoverflow.com/questions/54018031/left-join-in-pandas-without-the-creation-of-left-and-right-variables