is it possible to do fuzzy match merge with python pandas?

前端 未结 11 1591
[愿得一人]
[愿得一人] 2020-11-22 01:17

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I woul

11条回答
  •  南方客
    南方客 (楼主)
    2020-11-22 01:42

    I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].

    This is how I would do it with Jaro-Winkler from the jellyfish package:

    def get_closest_match(x, list_strings):
    
      best_match = None
      highest_jw = 0
    
      for current_string in list_strings:
        current_score = jellyfish.jaro_winkler(x, current_string)
    
        if(current_score > highest_jw):
          highest_jw = current_score
          best_match = current_string
    
      return best_match
    
    df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
    df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
    
    df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))
    
    df1.join(df2)
    

    Output:

        number  letter
    one     1   a
    two     2   b
    three   3   c
    four    4   d
    five    5   e
    

提交回复
热议问题