Python Fuzzy Matching (FuzzyWuzzy) - Keep only Best Match

后端 未结 3 1659
渐次进展
渐次进展 2020-12-08 23:36

I\'m trying to fuzzy match two csv files, each containing one column of names, that are similar but not the same.

My code so far is as follows:

impor         


        
相关标签:
3条回答
  • 2020-12-09 00:21

    I just wrote the same thing for myself but in pandas....

    import pandas as pd
    import numpy as np
    from fuzzywuzzy import fuzz
    from fuzzywuzzy import process
    
    d1={1:'Tim','2':'Ted',3:'Sally',4:'Dick',5:'Ethel'}
    d2={1:'Tam','2':'Tid',3:'Sally',4:'Dicky',5:'Aardvark'}
    
    df1=pd.DataFrame.from_dict(d1,orient='index')
    df2=pd.DataFrame.from_dict(d2,orient='index')
    
    df1.columns=['Name']
    df2.columns=['Name']
    
    def match(Col1,Col2):
        overall=[]
        for n in Col1:
            result=[(fuzz.partial_ratio(n, n2),n2) 
                    for n2 in Col2 if fuzz.partial_ratio(n, n2)>50
                   ]
            if len(result):
                result.sort()    
                print('result {}'.format(result))
                print("Best M={}".format(result[-1][1]))
                overall.append(result[-1][1])
            else:
                overall.append(" ")
        return overall
    
    print(match(df1.Name,df2.Name))
    

    I have used a threshold of 50 in this - but it is configurable.

    Dataframe1 looks like

        Name
    1   Tim
    2   Ted
    3   Sally
    4   Dick
    5   Ethel
    

    And Dataframe2 looks like

    Name
    1   Tam
    2   Tid
    3   Sally
    4   Dicky
    5   Aardvark
    

    So running it produces the matches of

    ['Tid', 'Tid', 'Sally', 'Dicky', ' ']
    

    Hope this helps.

    0 讨论(0)
  • 2020-12-09 00:24

    Several pieces of your code can be greatly simplified by using process.extractOne() from FuzzyWuzzy. Not only does it just return the top match, you can set a score threshold for it within the function call, rather than needing to perform a separate logical step, e.g.:

    process.extractOne(row, data, score_cutoff = 60)
    

    This function will return a tuple of the highest match plus the accompanying score if it finds a match satisfying the condition. It will return None otherwise.

    0 讨论(0)
  • 2020-12-09 00:32

    fuzzywuzzy's process.extract() returns the list in reverse sorted order , with the best match coming first.

    so to find just the best match, you can set the limit argument as 1 , so that it only returns the best match, and if that is greater than 60 , you can write it to the csv, like you are doing now.

    Example -

    from fuzzywuzzy import process
    ## For each row in the lookup compute the partial ratio
    for row in parse_csv("names_2.csv"):
    
        for found, score, matchrow in process.extract(row, data, limit=1):
            if score >= 60:
                print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
                Digi_Results = [row, score, found]
                writer.writerow(Digi_Results)
    
    0 讨论(0)
提交回复
热议问题