What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?

前端 未结 3 1020
旧时难觅i
旧时难觅i 2020-12-15 09:24

What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?

They seem to do the same thing, as far as I can tell – although the set of

3条回答
  •  甜味超标
    2020-12-15 09:41

    I’ve posted this question on the Unicode mailing list and got some more responses. I’ll post some of them here.

    Tom Gewecke wrote:

    I'm not an expert on this aspect of Unicode, but I understand that "grapheme extender" is a finer distinction in character properties designed to be used in certain specific and complex processes like grapheme breaking. You might find this blog article helpful in seeing where it comes into play: http://useless-factor.blogspot.com/2007/08/unicode-implementers-guide-part-4.html

    PS The answer by nwellnhof at StackOverflow is an excellent explanation of this issue in my view.

    Philippe Verdy wrote:

    Many grapheme extenders are not "combining characters". Combining characters are classified this way for legacy reasons (the very weak "general category" property) and this property is normatively stabilized. As well most combining characters have a non-zero combining class and they are stabilized for the purpose of normalization.

    Grapheme extenders include characters that are also NOT combining characters but controls (e.g. joiners). Some graphemclusters are also more complex in some scripts (there are extenders encoded BEFORE the base character; and they cannot be classified as combining characters because combining characters are always encoded AFTER a base character)

    For legacy reasons (and roundtrip compatibility with older standards) not all scripts are encoded using the UCS character model using combining characters. (E.g. the Thai script; not following the "logical" encoding order; but following the model used in TIS-620 and other standards based on it; including for Windows, and *nix/*nux).

    Richard Wordingham wrote:

    Spacing combining marks (category Mc) are in general not grapheme extenders. The ones that are included are mostly included so that the boundaries between 'legacy grapheme clusters' http://www.unicode.org/reports/tr29/tr29-23.html are invariant under canonical equivalence. There are six grapheme extenders that are not nonspacing (Mn) or enclosing (Me) and are not needed by this rule: ZWNJ, ZWJ, U+302E HANGUL SINGLE DOT TONE MARK U+302F HANGUL DOUBLE DOT TONE MARK U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

    I can see that it will sometimes be helpful to ZWNJ and ZWJ along with the previous base character. The fullwidth soundmarks U+3099 and U+309A are included for reasons of canonical equivalence, so it makes sense to include their halfwidth versions.

    I don't actually see the logic for including U+302E and U+302F. If you're going to encourage forcing someone who's typed the wrong base character before a sequence of 3 non-spacing marks to retype the lot, you may as well do the same with Hangul tone marks.

提交回复
热议问题