Algorithm to check for combining characters in Unicode

后端 未结 3 1635
耶瑟儿~
耶瑟儿~ 2020-12-18 02:45

I intend to normalize to Form C, then divide into \"display units\", basically a glyph plus all following combining characters. For now, I\'m just looking to handle the Lati

3条回答
  •  渐次进展
    2020-12-18 03:34

    These are all the ranges of Unicode points, whose name contains the word 'combining' (e.g. 301 COMBINING ACUTE ACCENT):

    300-36F
    483-489
    7EB-7F3
    135F-135F
    1A7F-1A7F
    1B6B-1B73
    1DC0-1DE6
    1DFD-1DFF
    20D0-20F0
    2CEF-2CF1
    2DE0-2DFF
    3099-309A
    A66F-A672
    A67C-A67D
    A6F0-A6F1
    A8E0-A8F1
    FE20-FE26
    101FD-101FD
    1D165-1D169
    1D16D-1D172
    1D17B-1D182
    1D185-1D18B
    1D1AA-1D1AD
    1D242-1D244

    I compiled this list with a Python script, making use of the unicodedata module. I don't know what version of Unicode this is exactly, but I think it's reasonably up to date.

    However, I don't know if you're done with characters that are 'combining' in the strict sense, as there are also 'modifier letters' and the like in Unicode.

提交回复
热议问题