问题
I have a pandas DataFrame consisting of two columns of strings. I would like to create a third column containing the Edit Distance of the two columns.
from nltk.metrics import edit_distance
df['edit'] = edit_distance(df['column1'], df['column2'])
For some reason this seems to go to some sort of infinite loop in the sense that it remains unresponsive for quite some time and then I have to terminate it manually.
Any suggestions are welcome.
回答1:
The nltk's edit_distance
function is for comparing pairs of strings. If you want to compute the edit distance between corresponding pairs of strings, apply
it separately to each row's strings like this:
results = df.apply(lambda x: edit_distance(x["column1"], x["column2"]), axis=1)
Or like this (probably a little more efficient), to avoid including the irrelevant columns of the dataframe:
results = df.loc[:, ["column1", "column2"]].apply(lambda x: edit_distance(*x), axis=1)
To add the results to your dataframe, you'd use it like this:
df["distance"] = df.loc[:, ["column1","column2"]].apply(lambda x: edit_distance(*x), axis=1)
来源:https://stackoverflow.com/questions/42892617/edit-distance-between-two-pandas-columns