What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?

问题

Also what's the vb.net function that will map all those different characters into their most standard form.

For example, tolower would map A and a to the same character right?

I need the same function for these characters

german

ß === s Ü === u Χιοσ == Χίος

Otherwise, sometimes I insert Χιοσ and latter when I insert Χίος mysql complaints that the ID already exist.

So I want to create a unique ID that maps all those strange characters into a more stable one.

回答1:

For the encoding aspect of the thing, look at String.Normalize. Notice also its overload that specifies a particular normal form to which you want to convert the string, but the default normal form (C) will work just fine for nearly everyone who wants to "map all those different characters into their most standard form".

However, things get more complicated once you move into the database and deal with collations.

Unicode normalization does not ever change the character case. It covers only cases where the characters are basically equivalent - look the same¹, mean the same thing. For example,

 Χιοσ != Χίος,

The two sigma characters are considered non-equivalent, and the accented iota (\u1F30) is equivalent to a sequence of two characters, the plain iota (\u03B9) and the accent (\u0313).

Your real problem seems to be that you are using Unicode strings as primary keys, which is not the most popular database design practice. Such primary keys take up more space than needed and are bound to change over time (even if the initial version of the application does not plan to support that). Oh, and I forgot their sensitivity to collations. Instead of identifying records by Unicode strings, have the database schema generate meaningless sequential integers for you as you insert the records, and demote the Unicode strings to mere attributes of the records. This way they can be the same or different as you please.

It may still be useful to normalize them before storing for the purpose of searching and safer subsequent processing; but the particular case insensitive collation that you use will no longer restrict you in any way.

¹Almost the same in case of compatibility normalization as opposed to canonical normalization.

来源：https://stackoverflow.com/questions/9834300/what-are-the-characters-that-count-as-the-same-character-under-collation-of-utf8

标签

vb.net

.net-4.0

utf-8

collation

unicode-normalization