is it possible to do fuzzy match merge with python pandas?

前端 未结 11 1666
[愿得一人]
[愿得一人] 2020-11-22 01:17

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I woul

11条回答
  •  傲寒
    傲寒 (楼主)
    2020-11-22 02:02

    Using fuzzywuzzy

    2019 answer

    Since there are no examples with the fuzzywuzzy package, here's a function I wrote which will return all matches based on a threshold you can set as a user:


    Example datframe

    df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
    df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
    
    # df1
              Key
    0       Apple
    1      Banana
    2      Orange
    3  Strawberry
    
    # df2
            Key
    0      Aple
    1     Mango
    2      Orag
    3     Straw
    4  Bannanna
    5     Berry
    

    Function for fuzzy matching

    def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
        """
        :param df_1: the left table to join
        :param df_2: the right table to join
        :param key1: key column of the left table
        :param key2: key column of the right table
        :param threshold: how close the matches should be to return a match, based on Levenshtein distance
        :param limit: the amount of matches that will get returned, these are sorted high to low
        :return: dataframe with boths keys and matches
        """
        s = df_2[key2].tolist()
    
        m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
        df_1['matches'] = m
    
        m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
        df_1['matches'] = m2
    
        return df_1
    

    Using our function on the dataframes: #1

    from fuzzywuzzy import fuzz
    from fuzzywuzzy import process
    
    fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)
    
              Key       matches
    0       Apple          Aple
    1      Banana      Bannanna
    2      Orange          Orag
    3  Strawberry  Straw, Berry
    

    Using our function on the dataframes: #2

    df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
    df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})
    
    fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)
    
            Col1  matches
    0  Microsoft  Mcrsoft
    1     Google    gogle
    2     Amazon   Amason
    3        IBM         
    

    Installation:

    Pip

    pip install fuzzywuzzy
    

    Anaconda

    conda install -c conda-forge fuzzywuzzy
    

提交回复
热议问题