How to match unicode characters with boost::spirit?

谁都会走 提交于 2019-11-30 13:04:05
sehe

I haven't got much experience with it, but apparently Spirit (SVN trunk version) supports Unicode.

#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout

See, e.g. the sexpr parser sample which is in the scheme demo.

BOOST_ROOT/libs/spirit/example/scheme

I believe this is based on the demo from a presentation by Bryce Lelbach1, which specifically showcases:

  • wchar support
  • utree attributes (still experimental)
  • s-expressions

There is an online article about S-expressions and variant.


1 In case it is indeed, here is the video from that presentation and the slides (pdf) as found here (odp)

You can't. The problem is not in boost::spirit but that Unicode is complicated. char doesn't mean a character, it means a 'byte'. And even if you work on the codepoint level, still a user perceived character may be represented by more than one codepoint. (e.g. пусты́нных is 9 characters but 10 codepoints. It may be not clear enough in Russian though because it doesn't use diacritics extensively. other languages do.)

To actually iterate over the user perceived character (or grapheme clusters in Unicode terminology), you'll need to use a Unicode specialized library, namely ICU.

However, what is the real-world use of iterating over the characters?

In Boost 1.58 I can match any unicode symbols with this:

*boost::spirit::qi::unicode::char_

I don't know how to define a specific range of unicode symbols.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!