How to match unicode characters with boost::spirit?

前端 未结 3 983
轻奢々
轻奢々 2021-01-02 05:48

How can I match utf8 unicode characters using boost::spirit?

For example, I want to recognize all characters in this string:

$ echo \"На         


        
3条回答
  •  轮回少年
    2021-01-02 05:51

    You can't. The problem is not in boost::spirit but that Unicode is complicated. char doesn't mean a character, it means a 'byte'. And even if you work on the codepoint level, still a user perceived character may be represented by more than one codepoint. (e.g. пусты́нных is 9 characters but 10 codepoints. It may be not clear enough in Russian though because it doesn't use diacritics extensively. other languages do.)

    To actually iterate over the user perceived character (or grapheme clusters in Unicode terminology), you'll need to use a Unicode specialized library, namely ICU.

    However, what is the real-world use of iterating over the characters?

提交回复
热议问题