What are all the Japanese whitespace characters?

后端 未结 2 1709
温柔的废话
温柔的废话 2020-12-20 13:26

I need to split a string and extract words separated by whitespace characters.The source may be in English or Japanese. English whitespace characters include tab and space,

相关标签:
2条回答
  • 2020-12-20 14:08

    I just found your posting. This is a great explantion about normalizing Unicode characters.

    http://en.wikipedia.org/wiki/Unicode_equivalence

    I found that many programming languages, like Python, have modules that can implement these normalization rules the Unicode standards. For my purposes, I found the following python code works very well. It converts all unicode variants of whitespace to the ascii range. After the normalization, a regex command can convert all white space to ascii \x32:

    import unicodedata
    # import re
    
    ucode = u'大変、 よろしくお願い申し上げます。'
    
    normalized = unicodedata.normalize('NFKC', ucode)
    
    # old code
    # utf8text = re.sub('\s+', ' ', normalized).encode('utf-8')
    
    # new code
    utf8text = ' '.join(normalized.encode('utf-8').split())
    

    Since the first writing, I learned Python's regex (re) module improperly itentifies these whitespace characters and can cause a crash if encountered. It turns out a faster, more reliable method to uses the .split() function.

    0 讨论(0)
  • 2020-12-20 14:09

    You need the ASCII tab, space and non-breaking space (U+00A0), and the full-width space, which you've correctly identified as U+3000. You might possibly want newlines and vertical space characters. If your input is in unicode (not Shift-JIS, etc.) then that's all you'll need. There are other (control) characters such as \0 NULL which are sometimes used as information delimiters, but they won't be rendered as a space in East Asian text - i.e., they won't appear as white-space.

    edit: Matt Ball has a good point in his comment, but, as his example illustrates, many regex implementations don't deal well with full-width East Asian punctuation. In this connection, it's worth mentioning that Python's string.whitespace won't cut the mustard either.

    0 讨论(0)
提交回复
热议问题