问题
The following SQL and its result shows the different string got the same checksum
result. Why?
select str ,
binary_checksum(str) binary_checksum,
checksum(str) checksum,
hashbytes('md5', str) md5
from ( values ( '2Volvo Director 20'), ( '3Volvo Director 30'), ( '4Volvo Director 40') )
t ( str )
str binary_checksum checksum md5 ------------------ --------------- ----------- -------------------------------------------- 2Volvo Director 20 -1356512636 -383039272 0xB9BD78BCF70FAC36AF14FFF589767278 3Volvo Director 30 -1356512636 -383039272 0xF039462F3D15B162FFCDB6125D290826 4Volvo Director 40 -1356512636 -383039272 0xFAF315CDA6E453CCC09838CFB129EE74
回答1:
SQL CHECKSUM() and MD5 are Hash functions. Hashing is a one way algorithm which can take any number of chars/bytes and return a fixed number of chars/bytes.
It mean no matter if your input is 1 character or a complete book (War and Peace) you will get back the same length of response. So the input is infinite number of combinations meanwhile the output is finite. Based on that it is inevitable to get the same Hash for different values. It is called Hash collision. Good Hash algorithms try to mitigate this to make it hard to find these colliding values.
But enough theory about hashing. Here is exactly the answer to your question. What is the issue with CHECKSUM()?
回答2:
Most probably your current database collation is CP1, default use to be SQL_Latin1_General_CP1_CI_AI
for SQL Server versions older than 2016 or 2017 (by experience, I couldn't confirm that from any official source), that collation has the same description
Latin1-General, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 54 on Code Page 1252 for non-Unicode Data
If you change it to a Unicode sensitive collation like Latin1_General_CI_AI
it returns different checksum for your values, the only difference between both collations is the Unicode part..
Latin1-General, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive
select str ,
binary_checksum(str) binary_checksum,
checksum(str) checksum,
hashbytes('md5', str) md5
from ( values ( '2Volvo Director 20'COLLATE Latin1_General_CI_AI), ( '3Volvo Director 30'COLLATE Latin1_General_CI_AI), ( '4Volvo Director 40'COLLATE Latin1_General_CI_AI) )
t ( str )
Using NVarchar
also returns different checksum, which confirms that this is an Unicode matter
select str ,
binary_checksum(str) binary_checksum,
checksum(str) checksum,
hashbytes('md5', str) md5
from ( values ( N'2Volvo Director 20'), ( N'3Volvo Director 30'), ( N'4Volvo Director 40') )
t ( str )
I could't find any source to explain why numbers are being treated as Unicode data for this matter
来源:https://stackoverflow.com/questions/41946650/why-checksum-returns-the-same-value-for-different-string