What is normalized UTF-8 all about?

后端未结

关注

 7  950

没有蜡笔的小新 2020-11-29 15:26

The ICU project (which also now has a PHP library) contains the classes needed to help normalize UTF-8 strings to make it easier to compare values when searching.

7条回答

独厮守ぢ (楼主)

2020-11-29 16:07

Normal forms (of Unicode, not databases) deal primarily (exclusively?) with characters that have diacritical marks. Unicode provides some characters with "built in" diacritical marks, such as U+00C0, "Latin Capital A with Grave". The same character can be created from a `Latin Capital A" (U+0041) with a "Combining Grave Accent" (U+0300). That means even though the two sequences produce the same resulting character, a byte-by-byte comparison will show them as being completely different.

Normalization is an attempt at dealing with that. Normalizing assures (or at least tries to) that all the characters are encoded the same way -- either all using a separate combining diacritical mark where needed, or all using a single code point wherever possible. From a viewpoint of comparison, it doesn't really matter a whole lot which you choose -- pretty much any normalized string will compare properly with another normalized string.

In this case, "compatibility" means compatibility with code that assumes that one code point equals one character. If you have code like that, you probably want to use the compatibility normal form. Although I've never seen it stated directly, the names of the normal forms imply that the Unicode consortium considers it preferable to use separate combining diacritical marks. This requires more intelligence to count the actual characters in a string (as well as things like breaking a string intelligently), but is more versatile.

If you're making full use of ICU, chances are that you want to use the canonical normal form. If you're trying to write code on your own that (for example) assumes a code point equals a character, then you probably want the compatibility normal form that makes that true as often as possible.

0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...