Why checksum() returns the same value for different string

问题

The following SQL and its result shows the different string got the same checksum result. Why?

select  str ,
        binary_checksum(str) binary_checksum,
        checksum(str) checksum,
        hashbytes('md5', str) md5
from    ( values ( '2Volvo Director 20'), ( '3Volvo Director 30'), ( '4Volvo Director 40') ) 
        t ( str )

str                binary_checksum checksum    md5
------------------ --------------- ----------- --------------------------------------------
2Volvo Director 20 -1356512636     -383039272  0xB9BD78BCF70FAC36AF14FFF589767278
3Volvo Director 30 -1356512636     -383039272  0xF039462F3D15B162FFCDB6125D290826
4Volvo Director 40 -1356512636     -383039272  0xFAF315CDA6E453CCC09838CFB129EE74

回答1:

SQL CHECKSUM() and MD5 are Hash functions. Hashing is a one way algorithm which can take any number of chars/bytes and return a fixed number of chars/bytes.

It mean no matter if your input is 1 character or a complete book (War and Peace) you will get back the same length of response. So the input is infinite number of combinations meanwhile the output is finite. Based on that it is inevitable to get the same Hash for different values. It is called Hash collision. Good Hash algorithms try to mitigate this to make it hard to find these colliding values.

But enough theory about hashing. Here is exactly the answer to your question. What is the issue with CHECKSUM()?

回答2:

Most probably your current database collation is CP1, default use to be SQL_Latin1_General_CP1_CI_AI for SQL Server versions older than 2016 or 2017 (by experience, I couldn't confirm that from any official source), that collation has the same description

Latin1-General, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 54 on Code Page 1252 for non-Unicode Data

If you change it to a Unicode sensitive collation like Latin1_General_CI_AI it returns different checksum for your values, the only difference between both collations is the Unicode part..

Latin1-General, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive

select  str ,
        binary_checksum(str) binary_checksum,
        checksum(str) checksum,
        hashbytes('md5', str) md5
from    ( values ( '2Volvo Director 20'COLLATE Latin1_General_CI_AI), ( '3Volvo Director 30'COLLATE Latin1_General_CI_AI), ( '4Volvo Director 40'COLLATE Latin1_General_CI_AI) ) 
        t ( str )

Using NVarchar also returns different checksum, which confirms that this is an Unicode matter

select  str ,
        binary_checksum(str) binary_checksum,
        checksum(str) checksum,
        hashbytes('md5', str) md5
from    ( values ( N'2Volvo Director 20'), ( N'3Volvo Director 30'), ( N'4Volvo Director 40') ) 
        t ( str )

I could't find any source to explain why numbers are being treated as Unicode data for this matter

来源：https://stackoverflow.com/questions/41946650/why-checksum-returns-the-same-value-for-different-string

标签

sql-server

sql-server-2012