How to merge pandas on string contains?

前端 未结 2 1481
轻奢々
轻奢々 2020-12-10 09:06

I have 2 dataframes that I would like to merge on a common column. However the column I would like to merge on are not of the same string, but rather a string from one is co

相关标签:
2条回答
  • 2020-12-10 09:48

    My solution involves applying a function to the common column. I can't imagine it holds up well when df2 is large but perhaps someone more knowledgeable than I can suggest an improvement.

    def strmerge(strcolumn):
        for i in df2['column_common']:
            if strcolumn in i:
                return df2[df2['column_common'] == i]['column_b'].values[0]
                break
            else:
                pass
    
    df1['column_b'] = df1.apply(lambda x: strmerge(x['column_common']),axis=1)
    
    df1
        column_a    column_common   column_b
    0   John        code            Moore
    1   Michael     other           Cohen
    2   Dan         ome             Smith
    3   George      no match        None
    4   Adam        word            Faber
    
    0 讨论(0)
  • 2020-12-10 10:03

    New Answer

    Here is one approach based on pandas/numpy.

    rhs = (df1.column_common
              .apply(lambda x: df2[df2.column_common.str.find(x).ge(0)]['column_b'])
              .bfill(axis=1)
              .iloc[:, 0])
    
    (pd.concat([df1.column_a, rhs], axis=1, ignore_index=True)
     .rename(columns={0: 'column_a', 1: 'column_b'}))
    
      column_a column_b
    0     John    Moore
    1  Michael    Cohen
    2      Dan    Smith
    3   George      NaN
    4     Adam    Faber
    

    Old Answer

    Here's a solution for left-join behaviour, as in it doesn't keep column_a values that do not match any column_b values. This is slower than the above numpy/pandas solution because it uses two nested iterrows loops to build a python list.

    tups = [(a1, a2) for i, (a1, b1) in df1.iterrows() 
                     for j, (a2, b2) in df2.iterrows()
            if b1 in b2]
    
    (pd.DataFrame(tups, columns=['column_a', 'column_b'])
       .drop_duplicates('column_a')
       .reset_index(drop=True))
    
      column_a column_b
    0     John    Moore
    1  Michael    Cohen
    2      Dan    Smith
    3     Adam    Faber
    
    0 讨论(0)
提交回复
热议问题