Pandas replace strings with fuzzy match in the same column

吃可爱长大的小学妹 提交于 2021-01-29 13:27:28

问题


I have a column in a dataframe that is like this:

 OWNER
 --------------
 OTTO J MAYER
 OTTO MAYER 
 DANIEL J ROSEN
 DANIEL ROSSY
 LISA CULLI
 LISA CULLY 
 LISA CULLY
 CITY OF BELMONT
 CITY OF BELMONT CITY

Some of the names in my data frame are misspelled or having extra/missing characters. I need a column where the names are replaced by any close match in the same column. However, all the similar names need to be group by under one same name.

For example this is I what I expect from the data frame above:

 NAME
 --------------
 OTTO J MAYER
 OTTO J MAYER 
 DANIEL J ROSEN
 DANIEL ROSSY
 LISA CULLY
 LISA CULLY 
 LISA CULLY
 CITY OF BELMONT
 CITY OF BELMONT

OTTO MAYER is replaced with OTTO J MAYER because they are both very similar. The DANIEL's stayed the same because they do not match much. The LISA CULL's all have the same values and etc.

I have some code I got from another post on stack overflow that was trying to solve something similar but they are using a dictionary of names. However, I'm having trouble reworking their code to produce the output that I need.

Here is what I have currently:

d = pd.DataFrame({'OWNER' : pd.Series(['OTTO J MAYER', 'OTTO MAYER','DANIEL J ROSEN','DANIEL ROSSY',
                                      'LISA CULLI', 'LISA CULLY'])})
names = d['OWNER']
names = names.values
names

import difflib 


def best_match(tokens, names):
    for i,t in enumerate(tokens):
        closest = difflib.get_close_matches(t, names, n=1)
        if len(closest) > 0:
            return i, closest[0]
    return None

def fuzzy_replace(x, y):

    names = y # just a simple replacement list
    tokens = x.split()
    res = best_match(tokens, y)
    if res is not None:
        pos, replacement = res
        return u" ".join(tokens)
    return x

d["OWNER"].apply(lambda x: fuzzy_replace(x, names))


回答1:


Indeed difflib.get_close_matches is fit for the task, but splitting the name into tokens does no good. In order to differentiate the names as specified, we have to raise the cutoff score to about 0.8, and to make sure that all possible names are returned, raise the maximum number to len(names). Then we have two cases to decide which name to prefer:

  • If a name occurs more often than the others, choose that one.
  • Otherwise choose the one occurring first.
def fuzzy_replace(x, names):
    aliases = difflib.get_close_matches(x, names, len(names), .8)
    closest = pd.Series(aliases).mode()
    closest = aliases[0] if closest.empty else closest[0]
    d['OWNER'].replace(aliases, closest, True)

for x in d["OWNER"]: fuzzy_replace(x, d['OWNER'])


来源:https://stackoverflow.com/questions/58904764/pandas-replace-strings-with-fuzzy-match-in-the-same-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!