Why does SQL_Latin1_General_CP1_CI_AS sort number-sign before underscore?

本秂侑毒 提交于 2019-12-08 08:18:27
Solomon Rutzky

First things first: the linked question -- Why doesn't ICU4J match UTF-8 sort order? -- hasn't been shown to be entirely correct yet ;-).

That related info aside, let's look at the various pieces:

  1. VARCHAR field with COLLATE SQL_Latin1_General_CP1_CI_AS:

    This is going to sort primarily based on ASCII values, and in the case of alphabetic characters, will sort and compare based on rules defined in Code Page 1 (a.k.a. Code Page 1252).

    The # character is ASCII code 35 while the _ character is ASCII code 95. These are not alphabetic characters so one should assume that they would sort with the # coming first when doing an ASCending order, as you are doing here.

  2. NVARCHAR field with COLLATE SQL_Latin1_General_CP1_CI_AS:

    This is going to sort according to Unicode rules. There are no Code Pages in Unicode, BUT there could be cultural differences that override the default sort rules and ordering. AND, to make things even more interesting, both the base rules and culture/locale -specific overrides can (and do) change over the years. Software vendors are not always that quick to implement new versions of standards. This is no different than various browsers implementing different W3C specifics at different points in time. The major updates in SQL Server came with version 2008 which introduced the 100 series of collations. SQL Server 2012 introduced variants of the 90 and 100 series, ending in _SC, to handle supplementary characters (i.e. the rest of the UTF-16 characters beyond the UCS-2 set).

    Going back to something mentioned a moment ago, each locale / culture can specify overrides of any of the rules (and not just sorting rules). The current version, 28 (released just 4 days ago!!), has the following for the US locale (found at: http://www.unicode.org/repos/cldr/tags/release-27/common/collation/en_US_POSIX.xml )

    <collation type="standard">
      <cr>
      <![CDATA[
        &A<*'\u0020'-'/'<*0-'@'<*ABCDEFGHIJKLMNOPQRSTUVWXYZ<*'['-'`'<*abcdefghijklmnopqrstuvwxyz <*'{'-'\u007F'
      ]]>
      </cr>
    </collation> 
    

    Reading the new syntax isn't super-easy, but I don't think they are reordering any of these punctuation characters. And if you go to their Collation Charts and click on the 4 link down (starting at the top, left), for "Punctuation", it certainly does list "_" as coming before all but one character.

    If we go back a few versions, we find (found at: http://www.unicode.org/repos/cldr/tags/release-23/common/collation/en_US_POSIX.xml ):

    <collation type="standard">
      <rules>
        <reset>A</reset>
        <pc>!"#$%&'()*+,-./</pc>
        <pc>0123456789:;<=>?@</pc>
        <pc>ABCDEFGHIJKLMNOPQRSTUVWXYZ</pc>
        <pc>[\]^_`</pc>
        <pc>abcdefghijklmnopqrstuvwxyz</pc>
        <pc>{|}~</pc>
      </rules>
    </collation> 
    

    Now here it does certainly look like they reordered it, and in the same order as the ASCII values?

    If you change the URL to point to version 24, that will look just like the current version 28 XML.

    According to the release dates found here CLDR Releases/Downloads, version 24 came out in 2013, well after the 100 series of collations were coded.

Gili

It turns out that @一二三 is right about SQL Server not implementing the default Unicode Collation Algorithm rules, but he was wrong about it using a codepage for unicode sorting. https://stackoverflow.com/a/32706510/14731 contains a detailed explanation of how unicode sorting is really implemented.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!