What is normalized UTF-8 all about?

后端 未结 7 951
没有蜡笔的小新
没有蜡笔的小新 2020-11-29 15:26

The ICU project (which also now has a PHP library) contains the classes needed to help normalize UTF-8 strings to make it easier to compare values when searching.

Ho

7条回答
  •  粉色の甜心
    2020-11-29 15:54

    This is actually fairly simple. UTF-8 actually has several different representations of the same "character". (I use character in quotes since byte-wise they are different, but practically they are the same). An example is given in the linked document.

    The character "Ç" can be represented as the byte sequence 0xc387. But it can also be represented by a C (0x43) followed by the byte sequence 0xcca7. So you can say that 0xc387 and 0x43cca7 are the same character. The reason that works, is that 0xcca7 is a combining mark; that is to say it takes the character before it (a C here), and modifies it.

    Now, as far as the difference between canonical equivalence vs compatibility equivalence, we need to look at characters in general.

    There are 2 types of characters, those that convey meaning through the value, and those that take another character and alter it. 9 is a meaningful character. A super-script ⁹ takes that meaning and alters it by presentation. So canonically they have different meanings, but they still represent the base character.

    Canonical equivalence is where the byte sequence is rendering the same character with the same meaning. Compatibility equivalence is when the byte sequence is rendering a different character with the same base meaning (even though it may be altered). The 9 and ⁹ are compatibility equivalent since they both mean "9", but are not canonically equivalent since they don't have the same representation.

提交回复
热议问题