String Distance Matrix in Python using pdist

前端 未结 3 518
温柔的废话
温柔的废话 2020-12-31 20:00

How to calculate Jaro Winkler distance matrix of strings in Python?

I have a large array of hand-entered strings (names and record numbers) and I\'m trying to find d

相关标签:
3条回答
  • 2020-12-31 20:26

    Here's a concise solution that requires neither numpy nor scipy:

    from Levenshtein import jaro_winkler
    data = ['Bob','Carl','Kristen','Calr', 'Doug']
    dm = [[ jaro_winkler(a, b) for b in data] for a in data]
    print('\n'.join([''.join([f'{item:6.2f}' for item in row]) for row in dm]))
    
      1.00  0.00  0.00  0.00  0.53
      0.00  1.00  0.46  0.93  0.00
      0.00  0.46  1.00  0.46  0.00
      0.00  0.93  0.46  1.00  0.00
      0.53  0.00  0.00  0.00  1.00
    
    0 讨论(0)
  • 2020-12-31 20:32

    For anyone with a similar problem - One solution I just found is to extract the relevant code from the pdist function and add a [0] to the jaro_winkler function input to call the string out of the numpy array.

    Example:

    X = np.asarray(fname, order='c')
    s = X.shape
    m, n = s
    dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
    
    k = 0
    for i in xrange(0, m - 1):
        for j in xrange(i + 1, m):
            dm[k] = jaro_winkler(X[i][0], X[j][0])
            k = k + 1
    
    dms = squareform(dm)
    

    Even though this algorithm works I'd still like to learn if there's a "right" computer-sciency-way to do this with the pdist function. Thanks, and hope this helps someone!

    0 讨论(0)
  • 2020-12-31 20:41

    You need to wrap the distance function, like I demonstrated in the following example with the Levensthein distance

    import numpy as np    
    from Levenshtein import distance
    from scipy.spatial.distance import pdist, squareform
    
    # my list of strings
    strings = ["hello","hallo","choco"]
    
    # prepare 2 dimensional array M x N (M entries (3) with N dimensions (1)) 
    transformed_strings = np.array(strings).reshape(-1,1)
    
    # calculate condensed distance matrix by wrapping the Levenshtein distance function
    distance_matrix = pdist(transformed_strings,lambda x,y: distance(x[0],y[0]))
    
    # get square matrix
    print(squareform(distance_matrix))
    
    Output:
    array([[ 0.,  1.,  4.],
           [ 1.,  0.,  4.],
           [ 4.,  4.,  0.]])
    
    0 讨论(0)
提交回复
热议问题