问题
How to calculate Levenshtein Distance matrix of strings in Python
str1 str2 str3 str4 ... strn
str1 0.8 0.4 0.6 0.1 ... 0.2
str2 0.4 0.7 0.5 0.1 ... 0.1
str3 0.6 0.5 0.6 0.1 ... 0.1
str4 0.1 0.1 0.1 0.5 ... 0.6
. . . . . ... .
. . . . . ... .
. . . . . ... .
strn 0.2 0.1 0.1 0.6 ... 0.7
Using Ditance function we can calculate distance betwwen 2 words. But here I have 1 list containing n number of strings. I wanted to calculate distance matrix after that I want to do clustering of words.
回答1:
Just use the pdist version that accepts a custom metric.
Y = pdist(X, levensthein)
and for the levensthein then you can use the implementation of rosettacode as suggested by Tanu
If you want a full squared matrix just use squareform on the result:
Y = scipy.spatial.distance.squareform(Y)
回答2:
Here is my code
import pandas as pd
from Levenshtein import distance
import numpy as np
Target = ['Tree','Trip','Treasure','Nothingtodo']
List1 = Target
List2 = Target
Matrix = np.zeros((len(List1),len(List2)),dtype=np.int)
for i in range(0,len(List1)):
for j in range(0,len(List2)):
Matrix[i,j] = distance(List1[i],List2[j])
print Matrix
[[ 0 2 4 11]
[ 2 0 6 10]
[ 4 6 0 11]
[11 10 11 0]]
回答3:
You could do something like this
from Levenshtein import distance
import numpy as np
from time import time
def get_distance_matrix(str_list):
""" Construct a levenshtein distance matrix for a list of strings"""
dist_matrix = np.zeros(shape=(len(str_list), len(str_list)))
t0 = time()
print "Starting to build distance matrix. This will iterate from 0 till ", len(str_list)
for i in range(0, len(str_list)):
print i
for j in range(i+1, len(str_list)):
dist_matrix[i][j] = distance(str_list[i], str_list[j])
for i in range(0, len(str_list)):
for j in range(0, len(str_list)):
if i == j:
dist_matrix[i][j] = 0
elif i > j:
dist_matrix[i][j] = dist_matrix[j][i]
t1 = time()
print "took", (t1-t0), "seconds"
return dist_matrix
str_list = ["analyze", "analyse", "analysis", "analyst"]
get_distance_matrix(str_list)
Starting to build distance matrix. This will iterate from 0 till 4
0
1
2
3
took 0.000197887420654 seconds
>>> array([[ 0., 1., 3., 2.],
[ 1., 0., 2., 1.],
[ 3., 2., 0., 2.],
[ 2., 1., 2., 0.]])
来源:https://stackoverflow.com/questions/37428973/string-distance-matrix-in-python