Why does the Difference function give different results when switching order of strings to compare?

☆樱花仙子☆ 提交于 2021-02-08 12:55:06

问题


In SQL Server, if I do the following:

Difference ('Kennady', 'Kary') : I get 2

If i do:

Difference ('Kary', 'Kennady') : I get 3.

I thought the Difference function looks at the Soundex values under the hood, and gives a 0-4 number of how many characters in place are the same.

SELECT SOUNDEX('Kennady') AS [SoundEx Kennady]
    , SOUNDEX('Kary') AS [SoundEx Kary]
    , DIFFERENCE ('Kennady', 'Kary') AS [Difference Kennady vs Kary]
    , DIFFERENCE ('Kary', 'Kennady') AS [Difference Kary vs Kennady];

回答1:


This is strictly observational. The documentation is pretty clear:

The integer returned is the number of characters in the SOUNDEX values that are the same. The return value ranges from 0 through 4: 0 indicates weak or no similarity, and 4 indicates strong similarity or the same values.

According to this documentation, the return value should not differ based on the order of the arguments.

From my queries: "Kennady" --> K530 and "Kary" --> K600. These have two characters in common, so the value should be 2.

Now, I notice that "Kenn" --> K500. Truncating "Kennady" to the length of "Kary" results in the value "3". Hmmm.

Hence, I think that DIFFERENCE() is using the length of the first argument to truncate the second argument. That makes the order of the arguments important. Put the longer argument first.

I tried this out on some other strings. The same patterns seems to work. I have not found any documentation that specifies that this is the case.

I suppose Microsoft would call this a "feature" and not a "bug" ;).

EDIT:

The above speculation is not quite correct. Consider the following

  • leepaupauld --> L114
  • leopold --> L143
  • leepaup --> L110

However,

  • difference(leepaupauld, leopold) = 4 (!)
  • difference(leopold, leepaupauld) = 3
  • difference(leepaup, leopold) = 3 (!)
  • difference(leopold, leepaup) = 2

The (!) is my judgement that the result makes no sense at all, given the soundex values for the strings.

So, the issue isn't the length. It is the underlying method, which @jpw points to in the comment. The problem appears to be duplicate matching values in one string. However, according to the documentation, these should not match the same character multiple times.

My advice: Use Levenshtein distance. It makes sense. It works better on longer strings. It is sane. It is not built in, but it is easy enough to find an implementation on the web for any database.




回答2:


Answer with example. Comparing 'Bathilda' and 'Bagshott'

First: Bathilda Soundex B-343, Second: Bagshot Soundex-B-230 Second searches in first. First Match: B; Next search starts after B, with 3 The '2' return no match. Second match is 3 from second matches first 3 from first. Iteration starts from 2. Third match is 3 from second matches second 3 from first. Result is 3.

Reverse - now First is Bagshot Soundex-B-230, Second is: Bathilda Soundex B-343 First match is again B. Iteration starts from 2. Second match is first 3 from second matches 3 from first. No more iterations are done, as 3 in first is the last letter.

Explanation: FROM https://msdn.microsoft.com/en-us/library/ms188753.aspx: "DIFFERENCE and SOUNDEX are collation sensitive." Which means that every search starts after the last match and goes to the last char in the sequence. That is why two sequences with same number of characters and same characters give result less than 4. For example: Difference for 'Brts' and 'Btrs' gives result 2.




回答3:


Reference the last post here that explains the algorithm: https://social.msdn.microsoft.com/Forums/en-US/a6ba987d-6fde-40d3-bcd0-4c7fd3d2e8cf/tsql-difference-function-returns-different-results-for-same-query?forum=transactsql

NOTE: This is all my opinion as to what is happening.

According to that post, it uses the FIRST parameter and then steps through character by character looking for matches in the second parameters.

As an example, my name "Vogel" = V240 in SOUNDEX. "Vasquez" = V220.

DIFFERENCE('Vogel','Vasquez') = 3

Because it checks "V", "2", "4", and "0" and finds 3 matches.

However,

DIFFERENCE('Vasquez','Vogel') = 4

Because it checks "V", "2", "2", and "0" and finds 4 matches.

It seems if the first parameter has a soundex with any duplicating digits it can produce unexpected results.



来源:https://stackoverflow.com/questions/40347930/why-does-the-difference-function-give-different-results-when-switching-order-of

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!