Algorithm to check for combining characters in Unicode

后端 未结 3 1627
耶瑟儿~
耶瑟儿~ 2020-12-18 02:45

I intend to normalize to Form C, then divide into \"display units\", basically a glyph plus all following combining characters. For now, I\'m just looking to handle the Lati

3条回答
  •  庸人自扰
    2020-12-18 03:32

    @lenz's answer covers most of the codepoints, but some were missing. Below a list of ranges found by processing the Names List file. Some codepoints have COMBINING in the name, but are no combining characters, like for example the Combining Grapheme Joiner (CGJ, 0x34f) [wiki]. As is quoted in the Wikipedia article:

    Its name is a misnomer and does not describe its function; the character does not join graphemes. Its purpose is to separate characters that should not be considered digraphs.

    When processing the list, the following ranges (and characters) were found. Note the ones that (slightly differ) from lenz's list are denoted with an exclamation mark (!). Often the range is slightly off, for example because one of the characters is not in the range, and thus the range is "cut in two":

      0x300 -   0x34e  !
      0x350 -   0x36f  !
      0x483 -   0x487  !
      0x591 -   0x5bd  !
      0x5bf            !
      0x5c1 -   0x5c2  !
      0x5c4 -   0x5c5  !
      0x5c7            !
      0x610 -   0x61a  !
      0x64b -   0x65f  !
      0x670            !
      0x6d6 -   0x6dc  !
      0x6df -   0x6e4  !
      0x6e7 -   0x6e8  !
      0x6ea -   0x6ed  !
      0x711            !
      0x730 -   0x74a  !
      0x7eb -   0x7f3
      0x816 -   0x819  !
      0x81b -   0x823  !
      0x825 -   0x827  !
      0x829 -   0x82d  !
      0x859 -   0x85b  !
      0x8d4 -   0x8e1  !
      0x8e3 -   0x8ff  !
      0x93c            !
      0x94d            !
      0x951 -   0x954  !
      0x9bc            !
      0x9cd            !
      0xa3c            !
      0xa4d            !
      0xabc            !
      0xacd            !
      0xb3c            !
      0xb4d            !
      0xbcd            !
      0xc4d            !
      0xc55 -   0xc56  !
      0xcbc            !
      0xccd            !
      0xd4d            !
      0xdca            !
      0xe38 -   0xe3a  !
      0xe48 -   0xe4b  !
      0xeb8 -   0xeb9  !
      0xec8 -   0xecb  !
      0xf18 -   0xf19  !
      0xf35            !
      0xf37            !
      0xf39            !
      0xf71 -   0xf72  !
      0xf74            !
      0xf7a -   0xf7d  !
      0xf80            !
      0xf82 -   0xf84  !
      0xf86 -   0xf87  !
      0xfc6            !
     0x1037            !
     0x1039 -  0x103a  !
     0x108d            !
     0x135d -  0x135f  !
     0x1714            !
     0x1734            !
     0x17d2            !
     0x17dd            !
     0x18a9            !
     0x1939 -  0x193b  !
     0x1a17 -  0x1a18  !
     0x1a60            !
     0x1a75 -  0x1a7c  !
     0x1a7f
     0x1ab0 -  0x1abd  !
     0x1b34            !
     0x1b44            !
     0x1b6b -  0x1b73
     0x1baa -  0x1bab  !
     0x1be6            !
     0x1bf2 -  0x1bf3  !
     0x1c37            !
     0x1cd0 -  0x1cd2  !
     0x1cd4 -  0x1ce0  !
     0x1ce2 -  0x1ce8  !
     0x1ced            !
     0x1cf4            !
     0x1cf8 -  0x1cf9  !
     0x1dc0 -  0x1df5  !
     0x1dfb -  0x1dff  !
     0x20d0 -  0x20dc  !
     0x20e1            !
     0x20e5 -  0x20f0  !
     0x2cef -  0x2cf1
     0x2d7f            !
     0x2de0 -  0x2dff
     0x302a -  0x302f  !
     0x3099 -  0x309a
     0xa66f            !
     0xa674 -  0xa67d  !
     0xa69e -  0xa69f  !
     0xa6f0 -  0xa6f1
     0xa806            !
     0xa8c4            !
     0xa8e0 -  0xa8f1
     0xa92b -  0xa92d  !
     0xa953            !
     0xa9b3            !
     0xa9c0            !
     0xaab0            !
     0xaab2 -  0xaab4  !
     0xaab7 -  0xaab8  !
     0xaabe -  0xaabf  !
     0xaac1            !
     0xaaf6            !
     0xabed            !
     0xfb1e            !
     0xfe20 -  0xfe2f  !
    0x101fd
    0x102e0            !
    0x10376 - 0x1037a  !
    0x10a0d            !
    0x10a0f            !
    0x10a38 - 0x10a3a  !
    0x10a3f            !
    0x10ae5 - 0x10ae6  !
    0x11046            !
    0x1107f            !
    0x110b9 - 0x110ba  !
    0x11100 - 0x11102  !
    0x11133 - 0x11134  !
    0x11173            !
    0x111c0            !
    0x111ca            !
    0x11235 - 0x11236  !
    0x112e9 - 0x112ea  !
    0x1133c            !
    0x1134d            !
    0x11366 - 0x1136c  !
    0x11370 - 0x11374  !
    0x11442            !
    0x11446            !
    0x114c2 - 0x114c3  !
    0x115bf - 0x115c0  !
    0x1163f            !
    0x116b6 - 0x116b7  !
    0x1172b            !
    0x11c3f            !
    0x16af0 - 0x16af4  !
    0x16b30 - 0x16b36  !
    0x1bc9e            !
    0x1d165 - 0x1d169
    0x1d16d - 0x1d172
    0x1d17b - 0x1d182
    0x1d185 - 0x1d18b
    0x1d1aa - 0x1d1ad
    0x1d242 - 0x1d244
    0x1e000 - 0x1e006  !
    0x1e008 - 0x1e018  !
    0x1e01b - 0x1e021  !
    0x1e023 - 0x1e024  !
    0x1e026 - 0x1e02a  !
    0x1e8d0 - 0x1e8d6  !
    0x1e944 - 0x1e94a  !
    

    This results in a total of 814 codepoints.

提交回复
热议问题