T-SQL Get percentage of character match of 2 strings

后端 未结 2 937
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-03 03:11

Let\'s say I have a set of 2 words:

Alexander and Alecsander OR Alexander and Alegzander

Alexander and Aleaxnder, or any other combination. In general we a

2条回答
  •  孤街浪徒
    2020-12-03 04:05

    Ultimately, you appear to be looking to solve for the likelihood that two strings are a "fuzzy" match to one another.

    SQL provides efficient, optimized built-in functions that will do that for you, and likely with better performance than what you have written. The two functions you are looking for are SOUNDEX and DIFFERENCE.

    While neither of them solves exactly what you asked for - i.e. they do not return a percentage match - I believe they solve what you are ultimately trying to achieve.

    SOUNDEX returns a 4-character code which is the first letter of the word plus a 3-number code that represents the sound pattern of the word. Consider the following:

    SELECT SOUNDEX('Alexander')
    SELECT SOUNDEX('Alegzander')
    SELECT SOUNDEX('Owleksanndurr')
    SELECT SOUNDEX('Ulikkksonnnderrr')
    SELECT SOUNDEX('Jones')
    
    /* Results:
    
    A425
    A425
    O425
    U425
    J520
    
    */
    

    What you will notice is that the three-digit number 425 is the same for all of the ones that roughly sound alike. So you could easily match them up and say "You typed 'Owleksanndurr', did you perhaps mean 'Alexander'?"

    In addition, there's the DIFFERENCE function, which compares the SOUNDEX discrepancy between two strings and gives it a score.

    SELECT DIFFERENCE(  'Alexander','Alexsander')
    SELECT DIFFERENCE(  'Alexander','Owleksanndurr')
    SELECT DIFFERENCE(  'Alexander', 'Jones')
    SELECT DIFFERENCE(  'Alexander','ekdfgaskfalsdfkljasdfl;jl;asdj;a')
    
    /* Results:
    
    4
    3
    1
    1     
    
    */
    

    As you can see, the lower the score (between 0 and 4), the more likely the strings are a match.

    The advantage of SOUNDEX over DIFFERENCE is that if you really need to do frequent fuzzy matching, you can store and index the SOUNDEX data in a separate (indexable) column, whereas DIFFERENCE can only calculate the SOUNDEX at the time of comparison.

提交回复
热议问题