I am interested in algorithm in T-SQL calculating Levenshtein distance.
I was looking for a code example for the Levenshtein algorithm, too, and was happy to find it here. Of course I wanted to understand how the algorithm is working and I was playing around a little bit with one of the above examples I was playing around a little bit that was posted by Veve. In order to get a better understanding of the code I created an EXCEL with the Matrix.
distance for FUZZY compared with FUZY
Images say more than 1000 words.
With this EXCEL I found that there was potential for additional performance optimization. All values in the upper right red area do not need to be calculated. The value of each red cell results in the value of the left cell plus 1. This is because, the second string will be always longer in that area than the first one, what increases the distance by the value of 1 for each character.
You can reflect that by using the statement IF @j <= @i and increasing the value of @i Prior to this statement.
CREATE FUNCTION [dbo].[f_LevenshteinDistance](@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE @s1_len int;
DECLARE @s2_len int;
DECLARE @i int;
DECLARE @j int;
DECLARE @s1_char nchar;
DECLARE @c int;
DECLARE @c_temp int;
DECLARE @cv0 varbinary(8000);
DECLARE @cv1 varbinary(8000);
SELECT
@s1_len = LEN(@s1),
@s2_len = LEN(@s2),
@cv1 = 0x0000 ,
@j = 1 ,
@i = 1 ,
@c = 0
WHILE @j <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1;
WHILE @i <= @s1_len
BEGIN
SELECT
@s1_char = SUBSTRING(@s1, @i, 1),
@c = @i ,
@cv0 = CAST(@i AS binary(2)),
@j = 1;
SET @i = @i + 1;
WHILE @j <= @s2_len
BEGIN
SET @c = @c + 1;
IF @j <= @i
BEGIN
SET @c_temp = CAST(SUBSTRING(@cv1, @j + @j - 1, 2) AS int) + CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END;
IF @c > @c_temp SET @c = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j + @j + 1, 2) AS int) + 1;
IF @c > @c_temp SET @c = @c_temp;
END;
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1;
END;
SET @cv1 = @cv0;
END;
RETURN @c;
END;