Edit distance between two pandas columns

拜拜、爱过 提交于 2019-12-23 08:43:30

问题


I have a pandas DataFrame consisting of two columns of strings. I would like to create a third column containing the Edit Distance of the two columns.

from nltk.metrics import edit_distance    
df['edit'] = edit_distance(df['column1'], df['column2'])

For some reason this seems to go to some sort of infinite loop in the sense that it remains unresponsive for quite some time and then I have to terminate it manually.

Any suggestions are welcome.


回答1:


The nltk's edit_distance function is for comparing pairs of strings. If you want to compute the edit distance between corresponding pairs of strings, apply it separately to each row's strings like this:

results = df.apply(lambda x: edit_distance(x["column1"], x["column2"]), axis=1)

Or like this (probably a little more efficient), to avoid including the irrelevant columns of the dataframe:

results = df.loc[:, ["column1", "column2"]].apply(lambda x: edit_distance(*x), axis=1)

To add the results to your dataframe, you'd use it like this:

df["distance"] = df.loc[:, ["column1","column2"]].apply(lambda x: edit_distance(*x), axis=1)


来源:https://stackoverflow.com/questions/42892617/edit-distance-between-two-pandas-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!