Levenshtein distance in T-SQL

后端 未结 6 725
Happy的楠姐
Happy的楠姐 2020-11-22 06:30

I am interested in algorithm in T-SQL calculating Levenshtein distance.

6条回答
  •  执笔经年
    2020-11-22 07:08

    I was looking for a code example for the Levenshtein algorithm, too, and was happy to find it here. Of course I wanted to understand how the algorithm is working and I was playing around a little bit with one of the above examples I was playing around a little bit that was posted by Veve. In order to get a better understanding of the code I created an EXCEL with the Matrix.

    distance for FUZZY compared with FUZY

    Images say more than 1000 words.

    With this EXCEL I found that there was potential for additional performance optimization. All values in the upper right red area do not need to be calculated. The value of each red cell results in the value of the left cell plus 1. This is because, the second string will be always longer in that area than the first one, what increases the distance by the value of 1 for each character.

    You can reflect that by using the statement IF @j <= @i and increasing the value of @i Prior to this statement.

    CREATE FUNCTION [dbo].[f_LevenshteinDistance](@s1 nvarchar(3999), @s2 nvarchar(3999))
        RETURNS int
        AS
        BEGIN
           DECLARE @s1_len  int;
           DECLARE @s2_len  int;
           DECLARE @i       int;
           DECLARE @j       int;
           DECLARE @s1_char nchar;
           DECLARE @c       int;
           DECLARE @c_temp  int;
           DECLARE @cv0     varbinary(8000);
           DECLARE @cv1     varbinary(8000);
    
           SELECT
              @s1_len = LEN(@s1),
              @s2_len = LEN(@s2),
              @cv1    = 0x0000  ,
              @j      = 1       , 
              @i      = 1       , 
              @c      = 0
    
           WHILE @j <= @s2_len
              SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1;
    
              WHILE @i <= @s1_len
                 BEGIN
                    SELECT
                       @s1_char = SUBSTRING(@s1, @i, 1),
                       @c       = @i                   ,
                       @cv0     = CAST(@i AS binary(2)),
                       @j       = 1;
    
                    SET @i = @i + 1;
    
                    WHILE @j <= @s2_len
                       BEGIN
                          SET @c = @c + 1;
    
                          IF @j <= @i 
                             BEGIN
                                SET @c_temp = CAST(SUBSTRING(@cv1, @j + @j - 1, 2) AS int) + CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END;
                                IF @c > @c_temp SET @c = @c_temp
                                SET @c_temp = CAST(SUBSTRING(@cv1, @j + @j + 1, 2) AS int) + 1;
                                IF @c > @c_temp SET @c = @c_temp;
                             END;
                          SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1;
                       END;
                    SET @cv1 = @cv0;
              END;
           RETURN @c;
        END;
    

提交回复
热议问题