What are all the Japanese whitespace characters?

后端未结

关注

 2  1715

温柔的废话 2020-12-20 13:26

I need to split a string and extract words separated by whitespace characters.The source may be in English or Japanese. English whitespace characters include tab and space,

2条回答

無奈伤痛 (楼主)

2020-12-20 14:09

You need the ASCII tab, space and non-breaking space (U+00A0), and the full-width space, which you've correctly identified as U+3000. You might possibly want newlines and vertical space characters. If your input is in unicode (not Shift-JIS, etc.) then that's all you'll need. There are other (control) characters such as \0 NULL which are sometimes used as information delimiters, but they won't be rendered as a space in East Asian text - i.e., they won't appear as white-space.

edit: Matt Ball has a good point in his comment, but, as his example illustrates, many regex implementations don't deal well with full-width East Asian punctuation. In this connection, it's worth mentioning that Python's string.whitespace won't cut the mustard either.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...