What is normalized UTF-8 all about?

后端未结

关注

 7  959

没有蜡笔的小新 2020-11-29 15:26

The ICU project (which also now has a PHP library) contains the classes needed to help normalize UTF-8 strings to make it easier to compare values when searching.

7条回答

孤独总比滥情好 (楼主)

2020-11-29 15:59

Some characters, for example a letter with an accent (say, é) can be represented in two ways - a single code point U+00E9 or the plain letter followed by a combining accent mark U+0065 U+0301. Ordinary normalization will choose one of these to always represent it (the single code point for NFC, the combining form for NFD).

For characters that could be represented by multiple sequences of base characters and combining marks (say, "s, dot below, dot above" vs putting dot above then dot below or using a base character that already has one of the dots), NFD will also pick one of these (below goes first, as it happens)

The compatibility decompositions include a number of characters that "shouldn't really" be characters but are because they were used in legacy encodings. Ordinary normalization won't unify these (to preserve round-trip integrity - this isn't an issue for the combining forms because no legacy encoding [except a handful of vietnamese encodings] used both), but compatibility normalization will. Think like the "kg" kilogram sign that appears in some East Asian encodings (or the halfwidth/fullwidth katakana and alphabet), or the "fi" ligature in MacRoman.

See http://unicode.org/reports/tr15/ for more details.

0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...