Character classification

南笙酒味 提交于 2019-12-23 08:53:26

问题


The simple question again: having an std::string, determine which of its characters are digits, symbols, white spaces etc. with respect to the user's language and regional settings (locale).

I managed to split the string into a set of characters using the boost locale boundary analysis tool:

std::string text = u8"生きるか死ぬか";

boost::locale::boundary::segment_index<std::string::const_iterator> characters(
    boost::locale::boundary::character,
    text.begin(), text.end(),
    boost::locale::generator()("ja_JP.UTF-8"));

for (const auto& ch : characters) {
    // each 'ch' is a single character in japanese language
}

However, I further do not see any way to determine if ch is a digit or a symbol or anything else. There are boost string classification algorithms, but these don't seem to be working with.. whatever *segment_index::iterator is.

Nor I can apply std::isalpha(std::locale), because I'm unsure if it is possible to convert the boost segment into a char or wchar_t.

Is there any neat way to classify symbols?


回答1:


There are a number of functions and objects supporting this in <locale> but... The example text you give looks like UTF-8, which is a multibyte encoding, and the functions in <locale> don't work with multibyte encodings.

I'd suggest you get the ICU library, and use it. Amongst other things, it allows testing for all of the properties defined in the Unicode Character Database. It also has macros or functions for iterating over a string (or at least an array of char), extracting one UTF_32 codepoint at a time (which is what you'd want to test).



来源:https://stackoverflow.com/questions/24485042/character-classification

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!