Most efficient way to calculate Levenshtein distance

后端 未结 6 438
傲寒
傲寒 2020-11-28 08:37

I just implemented a best match file search algorithm to find the closest match to a string in a dictionary. After profiling my code, I found out that the overwhelming major

6条回答
  •  误落风尘
    2020-11-28 09:28

    I modified the Levenshtein distance VBA function found on this post to use a one dimensional array. It performs much faster.

    'Calculate the Levenshtein Distance between two strings (the number of insertions,
    'deletions, and substitutions needed to transform the first string into the second)
    
    Public Function LevenshteinDistance2(ByRef s1 As String, ByRef s2 As String) As Long
    Dim L1 As Long, L2 As Long, D() As Long, LD As Long 'Length of input strings and distance matrix
    Dim i As Long, j As Long, ss2 As Long, ssL As Long, cost As Long 'loop counters, loop step, loop start, and cost of substitution for current letter
    Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
    Dim L1p1 As Long, L1p2 As Long 'Length of S1 + 1, Length of S1 + 2
    
    L1 = Len(s1): L2 = Len(s2)
    L1p1 = L1 + 1
    L1p2 = L1 + 2
    LD = (((L1 + 1) * (L2 + 1))) - 1
    ReDim D(0 To LD)
    ss2 = L1 + 1
    
    For i = 0 To L1 Step 1: D(i) = i: Next i                'setup array positions 0,1,2,3,4,...
    For j = 0 To LD Step ss2: D(j) = j / ss2: Next j        'setup array positions 0,1,2,3,4,...
    
    For j = 1 To L2
        ssL = (L1 + 1) * j
        For i = (ssL + 1) To (ssL + L1)
            If Mid$(s1, i Mod ssL, 1) <> Mid$(s2, j, 1) Then cost = 1 Else cost = 0
            cI = D(i - 1) + 1
            cD = D(i - L1p1) + 1
            cS = D(i - L1p2) + cost
    
            If cI <= cD Then 'Insertion or Substitution
                If cI <= cS Then D(i) = cI Else D(i) = cS
            Else 'Deletion or Substitution
                If cD <= cS Then D(i) = cD Else D(i) = cS
            End If
        Next i
    Next j
    
    LevenshteinDistance2 = D(LD)
    End Function
    

    I have tested this function with string 's1' of length 11,304 and 's2' of length 5,665 ( > 64 million character comparisons). With the above single dimension version of the function, the execution time is ~24 seconds on my machine. The original two dimensional function that I referenced in the link above requires ~37 seconds for the same strings. I have optimized the single dimensional function further as shown below and it requires ~10 seconds for the same strings.

    'Calculate the Levenshtein Distance between two strings (the number of insertions,
    'deletions, and substitutions needed to transform the first string into the second)
    Public Function LevenshteinDistance(ByRef s1 As String, ByRef s2 As String) As Long
    Dim L1 As Long, L2 As Long, D() As Long, LD As Long         'Length of input strings and distance matrix
    Dim i As Long, j As Long, ss2 As Long                       'loop counters, loop step
    Dim ssL As Long, cost As Long                               'loop start, and cost of substitution for current letter
    Dim cI As Long, cD As Long, cS As Long                      'cost of next Insertion, Deletion and Substitution
    Dim L1p1 As Long, L1p2 As Long                              'Length of S1 + 1, Length of S1 + 2
    Dim sss1() As String, sss2() As String                      'Character arrays for string S1 & S2
    
    L1 = Len(s1): L2 = Len(s2)
    L1p1 = L1 + 1
    L1p2 = L1 + 2
    LD = (((L1 + 1) * (L2 + 1))) - 1
    ReDim D(0 To LD)
    ss2 = L1 + 1
    
    For i = 0 To L1 Step 1: D(i) = i: Next i                    'setup array positions 0,1,2,3,4,...
    For j = 0 To LD Step ss2: D(j) = j / ss2: Next j            'setup array positions 0,1,2,3,4,...
    
    ReDim sss1(1 To L1)                                         'Size character array S1
    ReDim sss2(1 To L2)                                         'Size character array S2
    For i = 1 To L1 Step 1: sss1(i) = Mid$(s1, i, 1): Next i    'Fill S1 character array
    For i = 1 To L2 Step 1: sss2(i) = Mid$(s2, i, 1): Next i    'Fill S2 character array
    
    For j = 1 To L2
        ssL = (L1 + 1) * j
        For i = (ssL + 1) To (ssL + L1)
            If sss1(i Mod ssL) <> sss2(j) Then cost = 1 Else cost = 0
            cI = D(i - 1) + 1
            cD = D(i - L1p1) + 1
            cS = D(i - L1p2) + cost
            If cI <= cD Then 'Insertion or Substitution
                If cI <= cS Then D(i) = cI Else D(i) = cS
            Else 'Deletion or Substitution
                If cD <= cS Then D(i) = cD Else D(i) = cS
            End If
        Next i
    Next j
    
    LevenshteinDistance = D(LD)
    End Function
    

提交回复
热议问题