What are the upper and lower bound for Chinese char in UTF-8?

别等时光非礼了梦想. 提交于 2019-12-29 06:29:47

问题


I would like to make a set in python contains all the ord() of the Chinese chars:

for English the equivalent is :

english = set(range(ord('a'),ord('z') + 1 ) +
              range(ord('A'),ord('Z') + 1 ))

回答1:


From the Unicode Standard (v6.0, section 12.1),

Han ideographic characters are found in seven main blocks of the Unicode Standard, as shown in Table 12-2

Table 12-2. Blocks Containing Han Ideographs

Block                                   | Range       | Comment
----------------------------------------+-------------+-----------------------------------------------------
CJK Unified Ideographs                  | 4E00–9FFF   | Common
CJK Unified Ideographs Extension A      | 3400–4DBF   | Rare
CJK Unified Ideographs Extension B      | 20000–2A6DF | Rare, historic
CJK Unified Ideographs Extension C      | 2A700–2B73F | Rare, historic
CJK Unified Ideographs Extension D      | 2B740–2B81F | Uncommon, some in current use
CJK Compatibility Ideographs            | F900–FAFF   | Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement | 2F800–2FA1F | Unifiable variants

And there are a couple of extras, outside of these blocks:

Table 12-3. Small Extensions to the URO

Range     | Version | Comment
----------+---------+-------------------------------------------------
9FA6–9FB3 | 4.1     | Interoperability with HKSCS standard
9FB4–9FBB | 4.1     | Interoperability with GB 18030 standard
9FBC–9FC2 | 5.1     | Interoperability with commercial implementations
9FC3      | 5.1     | Correction of mistaken unification
9FC4–9FC6 | 5.2     | Interoperability with ARIB standard
9FC7–9FCB | 5.2     | Interoperability with HKSCS standard

To use set operations to construct a set of the ordinal values of these, you can do this:

chinese = set(range(0x4E00, 0xA000) +
              range(0x3400, 0x4DC0) +
              range(0x20000, 0x2A6E0) +
              range(0x2A700, 0x2B740) +
              range(0x2B740, 0x2B820) +
              range(0xF900, 0xFB00) +
              range(0x2F800, 0x2FA20) +
              range(0x9FA6, 0x9FCC))

Be aware, though, that this set contains over 75000 characters, so it may not be the most compact or efficient data structure for this.

Also, if you insist on using ord() on literal characters, you will need to use the 32-bit unicode literal form:

>>> ord(u'\U00002F800')
194560


来源:https://stackoverflow.com/questions/9166130/what-are-the-upper-and-lower-bound-for-chinese-char-in-utf-8

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!