How to find out Chinese or Japanese Character in a String in Python?

前端 未结 4 1133
北海茫月
北海茫月 2021-01-31 05:16

Such as:

str = \'sdf344asfasf天地方益3権sdfsdf\'

Add () to Chinese and Japanese Characters:

strAfterConvert = \'sdfasf         


        
4条回答
  •  天命终不由人
    2021-01-31 06:02

    As a start, you can check if the character is in one of the following unicode blocks:

    • Unicode Block 'CJK Unified Ideographs' - U+4E00 to U+9FFF
    • Unicode Block 'CJK Unified Ideographs Extension A' - U+3400 to U+4DBF
    • Unicode Block 'CJK Unified Ideographs Extension B' - U+20000 to U+2A6DF
    • Unicode Block 'CJK Unified Ideographs Extension C' - U+2A700 to U+2B73F
    • Unicode Block 'CJK Unified Ideographs Extension D' - U+2B740 to U+2B81F

    After that, all you need to do is iterate through the string, checking if the char is Chinese, Japanese or Korean (CJK) and append accordingly:

    # -*- coding:utf-8 -*-
    ranges = [
      {"from": ord(u"\u3300"), "to": ord(u"\u33ff")},         # compatibility ideographs
      {"from": ord(u"\ufe30"), "to": ord(u"\ufe4f")},         # compatibility ideographs
      {"from": ord(u"\uf900"), "to": ord(u"\ufaff")},         # compatibility ideographs
      {"from": ord(u"\U0002F800"), "to": ord(u"\U0002fa1f")}, # compatibility ideographs
      {'from': ord(u'\u3040'), 'to': ord(u'\u309f')},         # Japanese Hiragana
      {"from": ord(u"\u30a0"), "to": ord(u"\u30ff")},         # Japanese Katakana
      {"from": ord(u"\u2e80"), "to": ord(u"\u2eff")},         # cjk radicals supplement
      {"from": ord(u"\u4e00"), "to": ord(u"\u9fff")},
      {"from": ord(u"\u3400"), "to": ord(u"\u4dbf")},
      {"from": ord(u"\U00020000"), "to": ord(u"\U0002a6df")},
      {"from": ord(u"\U0002a700"), "to": ord(u"\U0002b73f")},
      {"from": ord(u"\U0002b740"), "to": ord(u"\U0002b81f")},
      {"from": ord(u"\U0002b820"), "to": ord(u"\U0002ceaf")}  # included as of Unicode 8.0
    ]
    
    def is_cjk(char):
      return any([range["from"] <= ord(char) <= range["to"] for range in ranges])
    
    def cjk_substrings(string):
      i = 0
      while i

    The above prints

    sdf344asfasf(天地方益)3(権)sdfsdf
    

    To be future-proof, you might want to keep a lookout for CJK Unified Ideographs Extension E. It will ship with Unicode 8.0, which is scheduled for release in June 2015. I've added it to the ranges, but you shouldn't include it until Unicode 8.0 is released.

    [EDIT]

    Added CJK compatibility ideographs, Japanese Kana and CJK radicals.

提交回复
热议问题