String Distance Matrix in Python

家住魔仙堡 提交于 2019-12-06 01:25:20

问题


How to calculate Levenshtein Distance matrix of strings in Python

              str1    str2    str3    str4    ...     strn
      str1    0.8     0.4     0.6     0.1     ...     0.2
      str2    0.4     0.7     0.5     0.1     ...     0.1
      str3    0.6     0.5     0.6     0.1     ...     0.1
      str4    0.1     0.1     0.1     0.5     ...     0.6
      .       .       .       .       .       ...     .
      .       .       .       .       .       ...     .
      .       .       .       .       .       ...     .
      strn    0.2     0.1     0.1     0.6     ...     0.7

Using Ditance function we can calculate distance betwwen 2 words. But here I have 1 list containing n number of strings. I wanted to calculate distance matrix after that I want to do clustering of words.


回答1:


Just use the pdist version that accepts a custom metric.

Y = pdist(X, levensthein)

and for the levensthein then you can use the implementation of rosettacode as suggested by Tanu

If you want a full squared matrix just use squareform on the result:

Y = scipy.spatial.distance.squareform(Y)



回答2:


Here is my code

import pandas as pd
from Levenshtein import distance
import numpy as np

Target = ['Tree','Trip','Treasure','Nothingtodo']

List1 = Target
List2 = Target

Matrix = np.zeros((len(List1),len(List2)),dtype=np.int)

for i in range(0,len(List1)):
  for j in range(0,len(List2)):
      Matrix[i,j] = distance(List1[i],List2[j])

print Matrix

[[ 0  2  4 11]
 [ 2  0  6 10]
 [ 4  6  0 11]
 [11 10 11  0]]



回答3:


You could do something like this

from Levenshtein import distance
import numpy as np
from time import time

def get_distance_matrix(str_list):
    """ Construct a levenshtein distance matrix for a list of strings"""
    dist_matrix = np.zeros(shape=(len(str_list), len(str_list)))
    t0 = time()
    print "Starting to build distance matrix. This will iterate from 0 till ", len(str_list) 
    for i in range(0, len(str_list)):
        print i
        for j in range(i+1, len(str_list)):
                dist_matrix[i][j] = distance(str_list[i], str_list[j]) 
    for i in range(0, len(str_list)):
        for j in range(0, len(str_list)):
            if i == j:
                dist_matrix[i][j] = 0 
            elif i > j:
                dist_matrix[i][j] = dist_matrix[j][i]
    t1 = time()
    print "took", (t1-t0), "seconds"
    return dist_matrix

str_list = ["analyze", "analyse", "analysis", "analyst"]
get_distance_matrix(str_list)

Starting to build distance matrix. This will iterate from 0 till  4
0
1
2
3
took 0.000197887420654 seconds
>>> array([[ 0.,  1.,  3.,  2.],
   [ 1.,  0.,  2.,  1.],
   [ 3.,  2.,  0.,  2.],
   [ 2.,  1.,  2.,  0.]])


来源:https://stackoverflow.com/questions/37428973/string-distance-matrix-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!