Python Fuzzy Matching (FuzzyWuzzy) - Keep only Best Match

后端未结

关注

 3  1661

I\'m trying to fuzzy match two csv files, each containing one column of names, that are similar but not the same.

My code so far is as follows:

impor


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  爱一瞬间的悲伤        
                
              
                            
                2020-12-09 00:21
              
            
            
                                                                       
I just wrote the same thing for myself but in pandas....

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

d1={1:'Tim','2':'Ted',3:'Sally',4:'Dick',5:'Ethel'}
d2={1:'Tam','2':'Tid',3:'Sally',4:'Dicky',5:'Aardvark'}

df1=pd.DataFrame.from_dict(d1,orient='index')
df2=pd.DataFrame.from_dict(d2,orient='index')

df1.columns=['Name']
df2.columns=['Name']

def match(Col1,Col2):
    overall=[]
    for n in Col1:
        result=[(fuzz.partial_ratio(n, n2),n2) 
                for n2 in Col2 if fuzz.partial_ratio(n, n2)>50
               ]
        if len(result):
            result.sort()    
            print('result {}'.format(result))
            print("Best M={}".format(result[-1][1]))
            overall.append(result[-1][1])
        else:
            overall.append(" ")
    return overall

print(match(df1.Name,df2.Name))


I have used a threshold of 50 in this - but it is configurable.

Dataframe1 looks like

    Name
1   Tim
2   Ted
3   Sally
4   Dick
5   Ethel


And Dataframe2 looks like

Name
1   Tam
2   Tid
3   Sally
4   Dicky
5   Aardvark


So running it produces the matches of 

['Tid', 'Tid', 'Sally', 'Dicky', ' ']


Hope this helps.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  渐次进展        
                
              
                            
                2020-12-09 00:24
              
            
            
                                                                       
Several pieces of your code can be greatly simplified by using process.extractOne() from FuzzyWuzzy.  Not only does it just return the top match, you can set a score threshold for it within the function call, rather than needing to perform a separate logical step, e.g.:

process.extractOne(row, data, score_cutoff = 60)


This function will return a tuple of the highest match plus the accompanying score if it finds a match satisfying the condition.  It will return None otherwise.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北荒        
                
              
                            
                2020-12-09 00:32
              
            
            
                                                                       
fuzzywuzzy's process.extract() returns the list in reverse sorted order , with the best match coming first.

so to find just the best match, you can set the limit argument as 1 , so that it only returns the best match, and if that is greater than 60 , you can write it to the csv, like you are doing now.

Example -

from fuzzywuzzy import process
## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):

    for found, score, matchrow in process.extract(row, data, limit=1):
        if score >= 60:
            print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
            Digi_Results = [row, score, found]
            writer.writerow(Digi_Results)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复