Python Fuzzy Matching (FuzzyWuzzy) - Keep only Best Match

后端 未结 3 1662
渐次进展
渐次进展 2020-12-08 23:36

I\'m trying to fuzzy match two csv files, each containing one column of names, that are similar but not the same.

My code so far is as follows:

impor         


        
3条回答
  •  爱一瞬间的悲伤
    2020-12-09 00:21

    I just wrote the same thing for myself but in pandas....

    import pandas as pd
    import numpy as np
    from fuzzywuzzy import fuzz
    from fuzzywuzzy import process
    
    d1={1:'Tim','2':'Ted',3:'Sally',4:'Dick',5:'Ethel'}
    d2={1:'Tam','2':'Tid',3:'Sally',4:'Dicky',5:'Aardvark'}
    
    df1=pd.DataFrame.from_dict(d1,orient='index')
    df2=pd.DataFrame.from_dict(d2,orient='index')
    
    df1.columns=['Name']
    df2.columns=['Name']
    
    def match(Col1,Col2):
        overall=[]
        for n in Col1:
            result=[(fuzz.partial_ratio(n, n2),n2) 
                    for n2 in Col2 if fuzz.partial_ratio(n, n2)>50
                   ]
            if len(result):
                result.sort()    
                print('result {}'.format(result))
                print("Best M={}".format(result[-1][1]))
                overall.append(result[-1][1])
            else:
                overall.append(" ")
        return overall
    
    print(match(df1.Name,df2.Name))
    

    I have used a threshold of 50 in this - but it is configurable.

    Dataframe1 looks like

        Name
    1   Tim
    2   Ted
    3   Sally
    4   Dick
    5   Ethel
    

    And Dataframe2 looks like

    Name
    1   Tam
    2   Tid
    3   Sally
    4   Dicky
    5   Aardvark
    

    So running it produces the matches of

    ['Tid', 'Tid', 'Sally', 'Dicky', ' ']
    

    Hope this helps.

提交回复
热议问题