How to use boost::spirit to parse UTF-8?

社会主义新天地 提交于 2019-11-29 11:11:24

If you have UTF-8 at input, then you may try to use Unicode Iterators from Boost.Regex.

For instance, use boost::u8_to_u32_iterator:

A Bidirectional iterator adapter that makes an underlying sequence of UTF8 characters look like a (read-only) sequence of UTF32 characters.

live demo

#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/range.hpp>
#include <iterator>
#include <iostream>
#include <ostream>
#include <cstdint>
#include <vector>

int main()
{
    using namespace boost;
    using namespace spirit::qi;
    using namespace std;

    auto &&utf8_text=u8"你好,世界!";
    u8_to_u32_iterator<const char*>
        tbegin(begin(utf8_text)), tend(end(utf8_text));

    vector<uint32_t> result;
    parse(tbegin, tend, *standard_wide::char_, result);
    for(auto &&code_point : result)
        cout << "&#" << code_point << ";";
    cout << endl;
}

Output is:

&#20320;&#22909;&#65292;&#19990;&#30028;&#65281;&#0;
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!