What is the point of COLLATIONS for nvarchar (Unicode) columns?

问题

IVe read a lot about this.

Still some questions :

Im not talking about case sensitive here...

If I have a char (ש for example) and he is stored in nvarchar - which can hold anything , Why would I need collation here ?
If I'm "FaceBook" and i need the ability to store all chars from all languages , What is the relationship between the collation and my nvarchar columns ?

Thanks in advance.

回答1:

Storing and representing characters is one thing, and knowing how to sort and compare them is another.

Unicode data, stored in the XML and N-prefixed types in SQL Server, can represent all characters in all languages (for the most part, and that is its goal) with a single character set. So for NCHAR / NVARCHAR data (I am leaving out NTEXT as it shouldn't be used anymore, and XML as it is not affected by Collations), the Collations do not change what characters can be stored. For CHAR and VARCHAR data, the Collations do affect what can be stored as each Collation points to a particular Code Page, which determines what can be stored in values 128 - 255.

Now, while there is a default sort order for all characters, that cannot possibly work across all languages and cultures. There are many languages that share some / many / all characters, but have different rules for how to sort them. For example, the letter "C" comes before the letter "D" in most alphabets that use those letters. In US English, a combination of "C" and "H" (i.e. "CH" as two separate letters) would naturally come before any string starting with a "D". But, in a few languages, the two-letter combination of "CH" is special and sorts after "D":

IF (   N'CH' COLLATE Czech_CI_AI > N'D' COLLATE Czech_CI_AI
   AND N'C'  COLLATE Czech_CI_AI < N'D' COLLATE Czech_CI_AI
   AND N'CI' COLLATE Czech_CI_AI < N'D' COLLATE Czech_CI_AI
   ) PRINT 'Czech_CI_AI';

IF (   N'CH' COLLATE Czech_100_CI_AI > N'D' COLLATE Czech_100_CI_AI
   AND N'C'  COLLATE Czech_100_CI_AI < N'D' COLLATE Czech_100_CI_AI
   AND N'CI' COLLATE Czech_100_CI_AI < N'D' COLLATE Czech_100_CI_AI
   ) PRINT 'Czech_100_CI_AI';

IF (   N'CH' COLLATE Slovak_CI_AI > N'D' COLLATE Slovak_CI_AI
   AND N'C'  COLLATE Slovak_CI_AI < N'D' COLLATE Slovak_CI_AI
   AND N'CI' COLLATE Slovak_CI_AI < N'D' COLLATE Slovak_CI_AI
   ) PRINT 'Slovak_CI_AI';

IF (   N'CH' COLLATE Slovak_CS_AS > N'D' COLLATE Slovak_CS_AS
   AND N'C'  COLLATE Slovak_CS_AS < N'D' COLLATE Slovak_CS_AS
   AND N'CI' COLLATE Slovak_CS_AS < N'D' COLLATE Slovak_CS_AS
   ) PRINT 'Slovak_CS_AS';

IF (   N'CH' COLLATE Latin1_General_100_CI_AS > N'D' COLLATE Latin1_General_100_CI_AS
   AND N'C'  COLLATE Latin1_General_100_CI_AS < N'D' COLLATE Latin1_General_100_CI_AS
   AND N'CI' COLLATE Latin1_General_100_CI_AS < N'D' COLLATE Latin1_General_100_CI_AS
   ) PRINT 'Latin1_General_100_CI_AS'
ELSE PRINT 'Nope!';

Returns:

Czech_CI_AI
Czech_100_CI_AI
Slovak_CI_AI
Slovak_CS_AS
Nope!

To see examples of sorting rules across various cultures, please see: Collation Charts.

Also, in some languages certain letters or combinations of letters equate to other letters in ways that they do not in most other languages. For example, only in Danish does a "å" equate to "aa". But, the "å" does not equate to just a single "a":

IF (N'aa' COLLATE Danish_Greenlandic_100_CI_AI =  N'å' COLLATE Danish_Greenlandic_100_CI_AI
AND N'a'  COLLATE Danish_Greenlandic_100_CI_AI <> N'å' COLLATE Danish_Greenlandic_100_CI_AI
   ) PRINT 'Danish_Greenlandic_100_CI_AI';

IF (   N'aa' COLLATE Danish_Norwegian_CI_AI =  N'å' COLLATE Danish_Norwegian_CI_AI
   AND N'a'  COLLATE Danish_Norwegian_CI_AI <> N'å' COLLATE Danish_Norwegian_CI_AI
   ) PRINT 'Danish_Norwegian_CI_AI';

IF (   N'aa' COLLATE Latin1_General_100_CI_AI =  N'å' COLLATE Latin1_General_100_CI_AI
   AND N'a'  COLLATE Latin1_General_100_CI_AI <> N'å' COLLATE Latin1_General_100_CI_AI
   ) PRINT 'Latin1_General_100_CI_AI'
ELSE PRINT 'Nope!';

Returns:

Danish_Greenlandic_100_CI_AI
Danish_Norwegian_CI_AI
Nope!

This is all highly complex, and I haven't even mentioned handling for right-to-left languages (Hebrew and Arabic), Chinese, Japanese, combining characters, etc.

If you want some deep insight into the rules, check out the Unicode Collation Algorithm (UCA). The examples above are based on examples in that documentation, though I do not believe all of the rules in the UCA have been implemented, especially since the Windows collations (collations not starting with SQL_) are based on Unicode 5.0 or 6.0, depending on the which OS you are using and the version of the .NET Framework that is installed (see SortVersion for details).

So that is what the Collations do. If you want to see all of the Collations that are available, just run the following:

SELECT [name] FROM sys.fn_helpcollations() ORDER BY [name];

回答2:

If you have one char, then there is no order. But if you order for example NAMES of PEOPLE - different special chars in different languages are ordered differently depending on collation.

first a collation may be case sensitive - show all B before b - and second special chars have special rules depending on collation.

The documentation is pretty good on that.

回答3:

I think the original poster is getting confused between CODE PAGES and COLLATIONS.

The "n" in nvarchar/nchar allows you to store text using the unicode number set which is large enough to incorporate all characters in all languages (in principle anyway) with a unique number. This itself isn't related to collations. nvarchar/nchar does not use CODE PAGES to encode/decode the meaning of each character code.

Collations define the sort order of characters and which character variants should be treated as identical. nvarchar/nchar DOES use COLLATIONS to define these distinctions.

来源：https://stackoverflow.com/questions/9756769/what-is-the-point-of-collations-for-nvarchar-unicode-columns

标签

sql-server

sql-server-2008

unicode

collation

nvarchar