How can I match utf8 unicode characters using boost::spirit
?
For example, I want to recognize all characters in this string:
$ echo \"На
You can't. The problem is not in boost::spirit but that Unicode is complicated. char
doesn't mean a character, it means a 'byte'. And even if you work on the codepoint level, still a user perceived character may be represented by more than one codepoint. (e.g. пусты́нных is 9 characters but 10 codepoints. It may be not clear enough in Russian though because it doesn't use diacritics extensively. other languages do.)
To actually iterate over the user perceived character (or grapheme clusters in Unicode terminology), you'll need to use a Unicode specialized library, namely ICU.
However, what is the real-world use of iterating over the characters?