I am new to pandas and can\'t seem to get this to work with merge function:
>>> left >>> right
a b c a c d
0 1 4
DataFrame.update() is nice, but it doesn't let you specify columns to join on and more importantly, if the other dataframe has NaN values, those NaN values will not overwrite non-nan values in the original DataFrame. To me, this is undesirable behavior.
Here's a custom method I rolled to fix these issues. It's freshly written, so users beware..
def join_insertion(into_df, from_df, on, cols, mult='error'):
"""
Suppose A and B are dataframes. A has columns {foo, bar, baz} and B has columns {foo, baz, buz}
This function allows you to do an operation like:
"where A and B match via the column foo, insert the values of baz and buz from B into A"
Note that this'll update A's values for baz and it'll insert buz as a new column.
This is a lot like DataFrame.update(), but that method annoyingly ignores NaN values in B!
:param into_df: dataframe you want to modify
:param from_df: dataframe with the values you want to insert
:param cols: list of column names (values to insert)
:param on: list of column names (values to join on), or a dict of {into:from} column name pairs
:param mult: if a key of into_df matches multiple rows of from_df, how should this be handled?
an error can be raised, or the first matching value can be inserted, or the last matching value
can be inserted
:return: a modified copy of into_df, with updated values using from_df
"""
# Infer left_on, right_on
if (isinstance(on, dict)):
left_on = list(on.keys())
right_on = list(on.values())
elif(isinstance(on, list)):
left_on = on
right_on = on
elif(isinstance(on, str)):
left_on = [on]
right_on = [on]
else:
raise Exception("on should be a list or dictionary")
# Make cols a list if it isn't already
if(isinstance(cols, str)):
cols = [cols]
# Setup
A = into_df.copy()
B = from_df[right_on + cols].copy()
# Insert row ids
A['_A_RowId_'] = np.arange(A.shape[0])
B['_B_RowId_'] = np.arange(B.shape[0])
A = pd.merge(
left=A,
right=B,
how='left',
left_on=left_on,
right_on=right_on,
suffixes=(None, '_y'),
indicator=True
).sort_values(['_A_RowId_', '_B_RowId_'])
# Check for rows of A which got duplicated by the merge, and then handle appropriately
if(mult == 'error'):
if(A.groupby('_A_RowId_').size().max() > 1):
raise Exception("At least one key of into_df matched multiple rows of from_df.")
elif(mult == 'first'):
A = A.groupby('_A_RowId_').first().reset_index()
elif(mult == 'last'):
A = A.groupby('_A_RowId_').last().reset_index()
mask = A._merge == 'both'
cols_in_both = list(set(into_df.columns.to_list()).intersection(set(cols)))
for col in cols_in_both:
A.loc[mask, col] = A.loc[mask, col + '_y']
# Drop unwanted columns
A.drop(columns=list(set(A.columns).difference(set(into_df.columns.to_list() + cols))), inplace=True)
return A
into_df = pd.DataFrame({
'foo': [1, 2, 3],
'bar': [4, 5, 6],
'baz': [7, 8, 9]
})
foo bar baz
0 1 4 7
1 2 5 8
2 3 6 9
from_df = pd.DataFrame({
'foo': [1, 3, 5, 7, 3],
'baz': [70, 80, 90, 30, 40],
'buz': [0, 1, 2, 3, 4]
})
foo baz buz
0 1 70 0
1 3 80 1
2 5 90 2
3 7 30 3
4 3 40 4
# Use it!
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='error')
Exception: At least one key of into_df matched multiple rows of from_df.
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='first')
foo bar baz buz
0 1 4 70.0 0.0
1 2 5 8.0 NaN
2 3 6 80.0 1.0
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='last')
foo bar baz buz
0 1 4 70.0 0.0
1 2 5 8.0 NaN
2 3 6 40.0 4.0
As an aside, this is one of those things I severely miss from R's data.table package. With data.table, this is as easy as x[y, Foo := i.Foo, on = c("a", "b")]