Comparing two simple strings in numpy using levenshtein?

大憨熊 提交于 2019-12-13 02:55:22

问题


I'm going crazy here. Python 3.5 PySpark 2.1. using code from here:

https://www.datacamp.com/community/tutorials/fuzzy-string-python

here is the function:

import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
    """ levenshtein_ratio_and_distance:
        Calculates levenshtein distance between two strings.
        If ratio_calc = True, the function computes the
        levenshtein distance ratio of similarity between two strings
        For all i and j, distance[i,j] will contain the Levenshtein
        distance between the first i characters of s and the
        first j characters of t
    """
    # Initialize matrix of zeros
    rows = len(s)+1
    cols = len(t)+1
    distance = np.zeros((rows,cols),dtype = int)

    # Populate matrix of zeros with the indeces of each character of both strings
    for i in range(1, rows):
        for k in range(1,cols):
            distance[i][0] = i
            distance[0][k] = k

    # Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions    
    for col in range(1, cols):
        for row in range(1, rows):
            if s[row-1] == t[col-1]:
                cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
            else:
                # In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
                # the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
                if ratio_calc == True:
                    cost = 2
                else:
                    cost = 1
            distance[row][col] = min(distance[row-1][col] + 1,      # Cost of deletions
                                 distance[row][col-1] + 1,          # Cost of insertions
                                 distance[row-1][col-1] + cost)     # Cost of substitutions
    if ratio_calc == True:
        # Computation of the Levenshtein Distance Ratio
        Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
        return Ratio
    else:
        # print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
        # insertions and/or substitutions
        # This is the minimum number of edits needed to convert string a to string b
        return "The strings are {} edits away".format(distance[row][col])

but when I try the most basic use of this function as outlined on the site:

a = 'this'
b = 'that'


levenshtein_ratio_and_distance(a,b)

I get this every time:

TypeError                                 Traceback (most recent call last)
<ipython-input-141-dde90cd15731> in <module>()
      4 
      5 
----> 6 levenshtein_ratio_and_distance(a,b)

<ipython-input-136-5d320a0eaf91> in levenshtein_ratio_and_distance(s, t, ratio_calc)
     34             distance[row][col] = min(distance[row-1][col] + 1,      # Cost of deletions
     35                                  distance[row][col-1] + 1,          # Cost of insertions
---> 36                                  distance[row-1][col-1] + cost)     # Cost of substitutions
     37     if ratio_calc == True:
     38         # Computation of the Levenshtein Distance Ratio

TypeError: _() takes 1 positional argument but 3 were given

Can someone please help me to understand why I am getting this error? This happens with EVERY variation of levenshtein I can find. I can't use fuzzywuzzy as I do not have rights to install packages.

EDIT:

I realized this is happening after I import my spark settings:

from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SQLContext
from pyspark import SparkContext, SparkConf 

来源:https://stackoverflow.com/questions/59076746/comparing-two-simple-strings-in-numpy-using-levenshtein

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!