T-SQL Get percentage of character match of 2 strings

后端 未结 2 928
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-03 03:11

Let\'s say I have a set of 2 words:

Alexander and Alecsander OR Alexander and Alegzander

Alexander and Aleaxnder, or any other combination. In general we a

相关标签:
2条回答
  • 2020-12-03 04:05

    Ultimately, you appear to be looking to solve for the likelihood that two strings are a "fuzzy" match to one another.

    SQL provides efficient, optimized built-in functions that will do that for you, and likely with better performance than what you have written. The two functions you are looking for are SOUNDEX and DIFFERENCE.

    While neither of them solves exactly what you asked for - i.e. they do not return a percentage match - I believe they solve what you are ultimately trying to achieve.

    SOUNDEX returns a 4-character code which is the first letter of the word plus a 3-number code that represents the sound pattern of the word. Consider the following:

    SELECT SOUNDEX('Alexander')
    SELECT SOUNDEX('Alegzander')
    SELECT SOUNDEX('Owleksanndurr')
    SELECT SOUNDEX('Ulikkksonnnderrr')
    SELECT SOUNDEX('Jones')
    
    /* Results:
    
    A425
    A425
    O425
    U425
    J520
    
    */
    

    What you will notice is that the three-digit number 425 is the same for all of the ones that roughly sound alike. So you could easily match them up and say "You typed 'Owleksanndurr', did you perhaps mean 'Alexander'?"

    In addition, there's the DIFFERENCE function, which compares the SOUNDEX discrepancy between two strings and gives it a score.

    SELECT DIFFERENCE(  'Alexander','Alexsander')
    SELECT DIFFERENCE(  'Alexander','Owleksanndurr')
    SELECT DIFFERENCE(  'Alexander', 'Jones')
    SELECT DIFFERENCE(  'Alexander','ekdfgaskfalsdfkljasdfl;jl;asdj;a')
    
    /* Results:
    
    4
    3
    1
    1     
    
    */
    

    As you can see, the lower the score (between 0 and 4), the more likely the strings are a match.

    The advantage of SOUNDEX over DIFFERENCE is that if you really need to do frequent fuzzy matching, you can store and index the SOUNDEX data in a separate (indexable) column, whereas DIFFERENCE can only calculate the SOUNDEX at the time of comparison.

    0 讨论(0)
  • 2020-12-03 04:06

    Ok, here is my solution so far:

    SELECT  [dbo].[GetPercentageOfTwoStringMatching]('valentin123456'  ,'valnetin123456')
    

    returns 86%

    CREATE FUNCTION [dbo].[GetPercentageOfTwoStringMatching]
    (
        @string1 NVARCHAR(100)
        ,@string2 NVARCHAR(100)
    )
    RETURNS INT
    AS
    BEGIN
    
        DECLARE @levenShteinNumber INT
    
        DECLARE @string1Length INT = LEN(@string1)
        , @string2Length INT = LEN(@string2)
        DECLARE @maxLengthNumber INT = CASE WHEN @string1Length > @string2Length THEN @string1Length ELSE @string2Length END
    
        SELECT @levenShteinNumber = [dbo].[LEVENSHTEIN] (   @string1  ,@string2)
    
        DECLARE @percentageOfBadCharacters INT = @levenShteinNumber * 100 / @maxLengthNumber
    
        DECLARE @percentageOfGoodCharacters INT = 100 - @percentageOfBadCharacters
    
        -- Return the result of the function
        RETURN @percentageOfGoodCharacters
    
    END
    
    
    
    
    -- =============================================     
    -- Create date: 2011.12.14
    -- Description: http://blog.sendreallybigfiles.com/2009/06/improved-t-sql-levenshtein-distance.html
    -- =============================================
    
    CREATE FUNCTION [dbo].[LEVENSHTEIN](@left  VARCHAR(100),
                                        @right VARCHAR(100))
    returns INT
    AS
      BEGIN
          DECLARE @difference    INT,
                  @lenRight      INT,
                  @lenLeft       INT,
                  @leftIndex     INT,
                  @rightIndex    INT,
                  @left_char     CHAR(1),
                  @right_char    CHAR(1),
                  @compareLength INT
    
          SET @lenLeft = LEN(@left)
          SET @lenRight = LEN(@right)
          SET @difference = 0
    
          IF @lenLeft = 0
            BEGIN
                SET @difference = @lenRight
    
                GOTO done
            END
    
          IF @lenRight = 0
            BEGIN
                SET @difference = @lenLeft
    
                GOTO done
            END
    
          GOTO comparison
    
          COMPARISON:
    
          IF ( @lenLeft >= @lenRight )
            SET @compareLength = @lenLeft
          ELSE
            SET @compareLength = @lenRight
    
          SET @rightIndex = 1
          SET @leftIndex = 1
    
          WHILE @leftIndex <= @compareLength
            BEGIN
                SET @left_char = substring(@left, @leftIndex, 1)
                SET @right_char = substring(@right, @rightIndex, 1)
    
                IF @left_char <> @right_char
                  BEGIN -- Would an insertion make them re-align?
                      IF( @left_char = substring(@right, @rightIndex + 1, 1) )
                        SET @rightIndex = @rightIndex + 1
                      -- Would an deletion make them re-align?
                      ELSE IF( substring(@left, @leftIndex + 1, 1) = @right_char )
                        SET @leftIndex = @leftIndex + 1
    
                      SET @difference = @difference + 1
                  END
    
                SET @leftIndex = @leftIndex + 1
                SET @rightIndex = @rightIndex + 1
            END
    
          GOTO done
    
          DONE:
    
          RETURN @difference
      END 
    
    0 讨论(0)
提交回复
热议问题