Character classification

问题

The simple question again: having an std::string, determine which of its characters are digits, symbols, white spaces etc. with respect to the user's language and regional settings (locale).

I managed to split the string into a set of characters using the boost locale boundary analysis tool:

std::string text = u8"生きるか死ぬか";

boost::locale::boundary::segment_index<std::string::const_iterator> characters(
    boost::locale::boundary::character,
    text.begin(), text.end(),
    boost::locale::generator()("ja_JP.UTF-8"));

for (const auto& ch : characters) {
    // each 'ch' is a single character in japanese language
}

However, I further do not see any way to determine if ch is a digit or a symbol or anything else. There are boost string classification algorithms, but these don't seem to be working with.. whatever *segment_index::iterator is.

Nor I can apply std::isalpha(std::locale), because I'm unsure if it is possible to convert the boost segment into a char or wchar_t.

Is there any neat way to classify symbols?

回答1:

There are a number of functions and objects supporting this in <locale> but... The example text you give looks like UTF-8, which is a multibyte encoding, and the functions in <locale> don't work with multibyte encodings.

I'd suggest you get the ICU library, and use it. Amongst other things, it allows testing for all of the properties defined in the Unicode Character Database. It also has macros or functions for iterating over a string (or at least an array of char), extracting one UTF_32 codepoint at a time (which is what you'd want to test).

来源：https://stackoverflow.com/questions/24485042/character-classification

标签

c++

string

boost

locale