Left join in pandas without the creation of left and right variables

问题

I'm missing something in the syntax of merging in pandas.

I have the following 2 data frames:

>>> dfA
  s_name  geo    zip  date value
0  A002X  zip  60601  2010  None
1  A002Y  zip  60601  2010  None
2  A003X  zip  60601  2010  None
3  A003Y  zip  60601  2010  None

(or potentially some values exist which will not overlap with dfB:

>>> dfA_alternate
  s_name  geo    zip  date value
0  A002X  zip  60601  2010   NaN
1  A002Y  zip  60601  2010   2.0
2  A003X  zip  60601  2010   NaN
3  A003Y  zip  60601  2010   NaN

)

And

>>> dfB
  s_name  geo    zip  date  value
0  A002X  zip  60601  2010    1.0
1  A002Y  zip  60601  2010    NaN
3  A003Y  zip  60601  2010    4.0

I'd like to join the data present in dfB onto dfA, like so:

>>> new
  s_name  geo    zip  date value
0  A002X  zip  60601  2010   1.0
1  A002Y  zip  60601  2010   NaN
2  A003X  zip  60601  2010   NaN
3  A003Y  zip  60601  2010   4.0

(or

>>> new_alternate
  s_name  geo    zip  date value
0  A002X  zip  60601  2010   1.0
1  A002Y  zip  60601  2010   2.0
2  A003X  zip  60601  2010   NaN
3  A003Y  zip  60601  2010   4.0

)

However, what seems like natural syntax actually makes extra columns:

>>> pd.merge(dfA,dfB,on=["s_name","geo","zip","date"],how="left")
  s_name  geo    zip  date value_x  value_y
0  A002X  zip  60601  2010    None      1.0
1  A002Y  zip  60601  2010    None      NaN
2  A003X  zip  60601  2010    None      NaN
3  A003Y  zip  60601  2010    None      4.0

(

>>> # alternate
>>> pd.merge(dfA_alterate,dfB,on=["s_name","geo","zip","date"],how="left")
  s_name  geo    zip  date value_x  value_y
0  A002X  zip  60601  2010     NaN      1.0
1  A002Y  zip  60601  2010     2.0      NaN
2  A003X  zip  60601  2010     NaN      NaN
3  A003Y  zip  60601  2010     NaN      4.0

)

There's value_x and value_y when I'd rather just have value.

I get that I can clean this up after the fact with:

new["value"] = new.apply(lambda r: r.value_x or r.value_y, axis=1)
new.drop(["value_x", "value_y"], axis=1, inplace=True)

But I imagine there's just merge syntax I need to change to get it right without post-processing. What am I missing?

回答1:

I think you need combine_first with MultiIndex created by set_index:

cols = ["s_name","geo","zip","date"]

df = dfA.set_index(cols).combine_first(dfB.set_index(cols)).reset_index()
print (df)
  s_name  geo    zip  date  value
0  A002X  zip  60601  2010    1.0
1  A002Y  zip  60601  2010    2.0
2  A003X  zip  60601  2010    NaN
3  A003Y  zip  60601  2010    4.0

Or update:

df = dfA.set_index(cols)
df.update(dfB.set_index(cols))
df = df.reset_index()

来源：https://stackoverflow.com/questions/54018031/left-join-in-pandas-without-the-creation-of-left-and-right-variables

标签

python

pandas

pandas-groupby