What are all the Japanese whitespace characters?

后端未结

关注

 2  1714

温柔的废话 2020-12-20 13:26

I need to split a string and extract words separated by whitespace characters.The source may be in English or Japanese. English whitespace characters include tab and space,

2条回答

半阙折子戏 (楼主)

2020-12-20 14:08
I just found your posting. This is a great explantion about normalizing Unicode characters.

http://en.wikipedia.org/wiki/Unicode_equivalence

I found that many programming languages, like Python, have modules that can implement these normalization rules the Unicode standards. For my purposes, I found the following python code works very well. It converts all unicode variants of whitespace to the ascii range. After the normalization, a regex command can convert all white space to ascii \x32:
```
import unicodedata
# import re

ucode = u'大変、 よろしくお願い申し上げます。'

normalized = unicodedata.normalize('NFKC', ucode)

# old code
# utf8text = re.sub('\s+', ' ', normalized).encode('utf-8')

# new code
utf8text = ' '.join(normalized.encode('utf-8').split())
```
Since the first writing, I learned Python's regex (re) module improperly itentifies these whitespace characters and can cause a crash if encountered. It turns out a faster, more reliable method to uses the .split() function.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...