Pandas fuzzy merge/match name column, with duplicates

前端 未结 3 1541
离开以前
离开以前 2020-12-16 05:44

I have two dataframes currently, one for donors and one for fundraisers. I\'m trying to find if any fundraisers also gave donations, a

相关标签:
3条回答
  • 2020-12-16 06:21

    Here's a bit more pythonic (in my view), working (on your example) code, without explicit loops:

    def get_donors(row):
        d = donors.apply(lambda x: fuzz.ratio(x['name'], row['name']) * 2 if row['Email'] == x['Email'] else 1, axis=1)
        d = d[d >= 75]
        if len(d) == 0:
            v = ['']*3
        else:
            v = donors.ix[d.idxmax(), ['name','Email','Date']].values
        return pd.Series(v, index=['donor name', 'donor email', 'donor date'])
    
    pd.concat((fundraisers, fundraisers.apply(get_donors, axis=1)), axis=1)
    

    Output:

                     Date           Email        name donor name     donor email           donor date
    0 2013-03-27 10:00:00          a@a.ca    John Doe   John Doe          a@a.ca  2013-03-01 10:39:00
    1 2013-03-01 10:39:00          a@a.ca    John Doe   John Doe          a@a.ca  2013-03-01 10:39:00
    2 2013-03-02 10:39:00          d@d.ca  Kathy test   Kat test          d@d.ca  2013-03-27 10:39:00
    3 2013-03-03 10:39:00    asdf@asdf.ca   Tes Ester                                                
    4 2013-03-04 10:39:00  something@a.ca    Jane Doe   Jane Doe  something@a.ca  2013-03-04 10:39:00
    
    0 讨论(0)
  • 2020-12-16 06:39

    I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].

    This is how I would do it with Jaro-Winkler from the jellyfish package:

    def get_closest_match(x, list_strings):
    
      best_match = None
      highest_jw = 0
    
      for current_string in list_strings:
        current_score = jellyfish.jaro_winkler(x, current_string)
    
        if(current_score > highest_jw):
          highest_jw = current_score
          best_match = current_string
    
      return best_match
    
    df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
    df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
    
    df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))
    
    df1.join(df2)
    

    Output:

        number  letter
    one     1   a
    two     2   b
    three   3   c
    four    4   d
    five    5   e
    

    Update: Use jaro_winkler from the Levenshtein module for improved performance.

    from jellyfish import jaro_winkler as jf_jw
    from Levenshtein import jaro_winkler as lv_jw
    
    %timeit jf_jw("appel", "apple")
    >> 339 ns ± 1.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
    %timeit lv_jw("appel", "apple")
    >> 193 ns ± 0.675 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
    0 讨论(0)
  • 2020-12-16 06:42

    How to identify Fuzzy duplication in DataFrame using Pandas

    This my data frame

    def get_ratio(row):
    name = row['Name_1']
    return fuzz.token_sort_ratio(name,"Ceylon Hotels Corporation")
    df[df.apply(get_ratio, axis=1) > 70]
    
    0 讨论(0)
提交回复
热议问题