Why does SQL_Latin1_General_CP1_CI_AS sort number-sign before underscore?

问题

Following up on https://stackoverflow.com/a/32233795/14731, I was surprised to discover that:

DECLARE @SampleData TABLE (ANSI VARCHAR(50), UTF16 NVARCHAR(50));
INSERT INTO @SampleData (ANSI, UTF16) VALUES 
    ('##MS_PolicyTsqlExecutionLogin##', N'##MS_PolicyTsqlExecutionLogin##'),
    ('_gaia', N'_gaia');

SELECT sd.ANSI AS [ANSI-SQL_Latin1_General_CP1_CI_AS]
FROM   @SampleData sd
ORDER BY sd.ANSI COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

SELECT sd.UTF16 AS [UTF16-SQL_Latin1_General_CP1_CI_AS]
FROM   @SampleData sd
ORDER BY sd.UTF16 COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

Results in:

ANSI-SQL_Latin1_General_CP1_CI_AS
-------------------------------------
##MS_PolicyTsqlExecutionLogin##
_gaia

UTF16-SQL_Latin1_General_CP1_CI_AS
-------------------------------------
##MS_PolicyTsqlExecutionLogin##
_gaia

When, according to "Why doesn't ICU4J match UTF-8 sort order?", the Unicode results are supposed to be in the opposite order. Why is this the case?

回答1:

First things first: the linked question -- Why doesn't ICU4J match UTF-8 sort order? -- hasn't been shown to be entirely correct yet ;-).

That related info aside, let's look at the various pieces:

VARCHAR field with COLLATE SQL_Latin1_General_CP1_CI_AS:

This is going to sort primarily based on ASCII values, and in the case of alphabetic characters, will sort and compare based on rules defined in Code Page 1 (a.k.a. Code Page 1252).

The # character is ASCII code 35 while the _ character is ASCII code 95. These are not alphabetic characters so one should assume that they would sort with the # coming first when doing an ASCending order, as you are doing here.
NVARCHAR field with COLLATE SQL_Latin1_General_CP1_CI_AS:

This is going to sort according to Unicode rules. There are no Code Pages in Unicode, BUT there could be cultural differences that override the default sort rules and ordering. AND, to make things even more interesting, both the base rules and culture/locale -specific overrides can (and do) change over the years. Software vendors are not always that quick to implement new versions of standards. This is no different than various browsers implementing different W3C specifics at different points in time. The major updates in SQL Server came with version 2008 which introduced the 100 series of collations. SQL Server 2012 introduced variants of the 90 and 100 series, ending in _SC, to handle supplementary characters (i.e. the rest of the UTF-16 characters beyond the UCS-2 set).

Going back to something mentioned a moment ago, each locale / culture can specify overrides of any of the rules (and not just sorting rules). The current version, 28 (released just 4 days ago!!), has the following for the US locale (found at: http://www.unicode.org/repos/cldr/tags/release-27/common/collation/en_US_POSIX.xml )
```
<collation type="standard">
  <cr>
  <![CDATA[
    &A<*'\u0020'-'/'<*0-'@'<*ABCDEFGHIJKLMNOPQRSTUVWXYZ<*'['-'`'<*abcdefghijklmnopqrstuvwxyz <*'{'-'\u007F'
  ]]>
  </cr>
</collation> 
```
Reading the new syntax isn't super-easy, but I don't think they are reordering any of these punctuation characters. And if you go to their Collation Charts and click on the 4 link down (starting at the top, left), for "Punctuation", it certainly does list "_" as coming before all but one character.

If we go back a few versions, we find (found at: http://www.unicode.org/repos/cldr/tags/release-23/common/collation/en_US_POSIX.xml ):
```
<collation type="standard">
  <rules>
    <reset>A</reset>
    <pc>!"#$%&'()*+,-./</pc>
    <pc>0123456789:;<=>?@</pc>
    <pc>ABCDEFGHIJKLMNOPQRSTUVWXYZ</pc>
    <pc>[\]^_`</pc>
    <pc>abcdefghijklmnopqrstuvwxyz</pc>
    <pc>{|}~</pc>
  </rules>
</collation> 
```
Now here it does certainly look like they reordered it, and in the same order as the ASCII values?

If you change the URL to point to version 24, that will look just like the current version 28 XML.

According to the release dates found here CLDR Releases/Downloads, version 24 came out in 2013, well after the 100 series of collations were coded.

回答2:

It turns out that @一二三 is right about SQL Server not implementing the default Unicode Collation Algorithm rules, but he was wrong about it using a codepage for unicode sorting. https://stackoverflow.com/a/32706510/14731 contains a detailed explanation of how unicode sorting is really implemented.

来源：https://stackoverflow.com/questions/32705604/why-does-sql-latin1-general-cp1-ci-as-sort-number-sign-before-underscore

标签

sql-server

sorting

unicode

collation