Copy approximate string matching from excel to another excel file using python

问题

Hi I would like to ask on how to copy some of the row from one excel file to another excel file. By using python fuzzy matching method or ANY other feasible way, the entire row by according to the name is hope to be matched and copied into new excel file.

Here is the input data from first excel file, there is 13 rows and 6 columns in total as shown below:

-----------------------------------------------------|-----|-----|-----|-----|-----|
| name                                               | no1 | no2 | no3 | no4 | no5 |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Club___Short___Water           | abc | abc | abc | abc | abc |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Short___Water                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long___Land to Short___Water           | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___BB to Penang___AA                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___SD to Penang___SD                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA(N) to Back___Garden(N)     | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA___(N) to Back___Garden     | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Left___House___Hostel(w) to NothingNow___(w)  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Laksama to Kota_Dun                           | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

By ingoring the first row, I would like to let python to recognize the approximate similar name of row and copy entire row and paste into a new excel file. By comparing the similarity of the words instead of alphabet, like how many of words is the same, if more than or equal to a certain amount (let say 50%), it would pass to copy.

For example, by comparing row 2 and row 3, from Club___Long to Club___Short___Water is quite similar to from Club___Long to Short___Water, from Club___Long to Club___Short___Water has 7 words while from Club___Long to Short___Water has 6 words. Out of 7 words of from Club___Long to Club___Short___Water, there is 6 words similar to from Club___Long to Short___Water. Therefore, 6 / 7 * 100% = 85.71% which is more than 50%, python would consider it as matched and copy it.

For instance, row 2 to row 4 is approximately the same, so python would match it and recognize it almost the same, and copy only entire row 2 to entire row 4 to new excel file, and name it as 'new_file_1.xlsx'. The desired output as shown below:

-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Club___Short___Water           | abc | abc | abc | abc | abc |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Short___Water                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long___Land to Short___Water           | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

Same goes with the row 5 and row 6, and name it as 'new_file_2.xlsx', the desired output as shown below:

-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___BB to Penang___AA                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___SD to Penang___SD                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

Same goes with the row 7 until row 9, and name it as 'new_file_3.xlsx', the desired output as shown below:

-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

Same goes with the row 10 until row 11, and name it as 'new_file_4.xlsx', the desired output as shown below:

-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA(N) to Back___Garden(N)     | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA___(N) to Back___Garden     | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

Regarding row 12 and row 13, they both is dissimilar to other row, so do not have to copy, just leave it.

Appreciate a lot if anyone can help me out, thanks!

回答1:

I create a function to replace duplicates. it's base on fuzzy logic. I simply substitute each name with the highest matching of all the other names considering a cutoff. Then, I create a new column where I store these unique names

import difflib
import re

def similarity_replace(series):

    reverse_map = {}
    diz_map = {}
    for i,s in series.iteritems():

        clean_s = re.sub(r'(from)|(to)', '', s.lower())
        clean_s = re.sub(r'[^a-z]', '', clean_s)

        diz_map[s] = clean_s
        reverse_map[clean_s] = s

    best_match = {}
    uni = list(set(diz_map.values()))
    for w in uni:
        best_match[w] = sorted(difflib.get_close_matches(w, uni, n=3, cutoff=0.6))[0]

    return series.map(diz_map).map(best_match).map(reverse_map)

df = pd.DataFrame({'name':['from Club___Long to Club___Short___Water','from Club___Long to Short___Water',
                           'from Club___Long___Land to Short___Water','from Kinabalu___BB to Penang___AA',
                           'from Kinabalu___SD to Penang___SD','from Hill___Town to Unknown___Island___Ice',
                           'from Hill___Town to Unknown___Island___Ice','from Hill___Town to Unknown___Island___Ice',
                           'from Front___House___AA(N) to Back___Garden(N)','from Front___House___AA___(N) to Back___Garden',
                           'from Left___House___Hostel(w) to NothingNow___(w)','from Laksama to Kota_Dun'],
                  'no1':['adb','adb','adb','adb','adb','adb','adb','adb','adb','adb','adb','adb']})

df['group_name'] = similarity_replace(df.name)
df

we can use this column to group all the instances that are similar and to something

for i,group in df.groupby('group_name'):

    ### do something ###
    print(group[['name','no1']])

来源：https://stackoverflow.com/questions/61874002/copy-approximate-string-matching-from-excel-to-another-excel-file-using-python

标签

python

pandas

matching

Fuzzy