How to merge pandas on string contains?

前端未结

关注

 2  1483

I have 2 dataframes that I would like to merge on a common column. However the column I would like to merge on are not of the same string, but rather a string from one is co

相关标签:

2条回答

一整个雨季

2020-12-10 09:48

My solution involves applying a function to the common column. I can't imagine it holds up well when df2 is large but perhaps someone more knowledgeable than I can suggest an improvement.

def strmerge(strcolumn):
    for i in df2['column_common']:
        if strcolumn in i:
            return df2[df2['column_common'] == i]['column_b'].values[0]
            break
        else:
            pass

df1['column_b'] = df1.apply(lambda x: strmerge(x['column_common']),axis=1)

df1
    column_a    column_common   column_b
0   John        code            Moore
1   Michael     other           Cohen
2   Dan         ome             Smith
3   George      no match        None
4   Adam        word            Faber

0 讨论(0)

暗喜

2020-12-10 10:03

New Answer

Here is one approach based on pandas/numpy.

rhs = (df1.column_common
          .apply(lambda x: df2[df2.column_common.str.find(x).ge(0)]['column_b'])
          .bfill(axis=1)
          .iloc[:, 0])

(pd.concat([df1.column_a, rhs], axis=1, ignore_index=True)
 .rename(columns={0: 'column_a', 1: 'column_b'}))

  column_a column_b
0     John    Moore
1  Michael    Cohen
2      Dan    Smith
3   George      NaN
4     Adam    Faber

Old Answer

Here's a solution for left-join behaviour, as in it doesn't keep column_a values that do not match any column_b values. This is slower than the above numpy/pandas solution because it uses two nested iterrows loops to build a python list.

tups = [(a1, a2) for i, (a1, b1) in df1.iterrows() 
                 for j, (a2, b2) in df2.iterrows()
        if b1 in b2]

(pd.DataFrame(tups, columns=['column_a', 'column_b'])
   .drop_duplicates('column_a')
   .reset_index(drop=True))

  column_a column_b
0     John    Moore
1  Michael    Cohen
2      Dan    Smith
3     Adam    Faber

0 讨论(0)