Creating a boost::spirit::x3 parser for quoted strings with escape sequence handling

a 夏天 提交于 2021-02-10 05:27:32

问题


I need to create a parser for quoted strings for my custom language that will also properly handle escape sequences, which includes allowing escaped quotes within the string. This is my current string parser:

x3::lexeme[quote > *(x3::char_ - quote) > quote]

where quote is just a constant expression for '"'. It does no escape sequence handling whatsoever. I know about boost::spirit::classic::lex_escape_ch_p, but I've no idea how to use that with the boost::spirit::x3 tools (or in general). How could I create a parser that does this? The parser would have to recognize most escape sequences, such as common ones like '\n', '\t', and more complex stuff like hex, oct, and ansi escape sequences.

My apologies if there's something wrong with this post, it's my first time posting on SO.

EDIT:

Here is how I ended up implementing the parser:

x3::lexeme[quote > *(
    ("\\\"" >> &x3::char_) >> x3::attr(quote) | ~x3::char_(quote)
    ) > quote]
[handle_escape_sequences];

where handle_escape_sequences is a lambda:

auto handle_escape_sequences = [&](auto&& context) -> void {
    std::string& str = x3::_val(context);

    uint32_t i{};

    static auto replace = [&](const char replacement) -> void {
        str[i++] = replacement;
    };

    if (!classic::parse(std::begin(str), std::end(str), *classic::lex_escape_ch_p[replace]).full)
        throw Error{ "invalid literal" }; // invalid escape sequence most likely

    str.resize(i);
};

It does full ANSI escape sequence parsing, which means you can use it to do all sorts of fancy terminal manipulation like setting the text color, cursor position, etc. with it.

Here's the full definition of the rule as well as all of the stuff it depends on (I just picked everything related to it out of my code so that's why the result looks like proper spaghetti) in case someone happens to need it:

#include <boost\spirit\home\x3.hpp>
#include <boost\spirit\include\classic_utility.hpp>

using namespace boost::spirit;

#define RULE_DECLARATION(rule_name, attribute_type)                            \
inline namespace Tag { class rule_name ## _tag; }                              \
x3::rule<Tag::rule_name ## _tag, attribute_type, true> rule_name = #rule_name; \

#define SIMPLE_RULE_DEFINITION(rule_name, attribute_type, definition) \
RULE_DECLARATION(rule_name, attribute_type)                           \
auto rule_name ## _def = definition;                                  \
BOOST_SPIRIT_DEFINE(rule_name);

constexpr char quote = '"';


template <class Base, class>
struct Access_base_s : Base {
    using Base::Base, Base::operator=;
};

template <class Base, class Tag>
using Unique_alias_for = Access_base_s<Base, Tag>;


using String_literal = Unique_alias_for<std::string, class String_literal_tag>;

SIMPLE_RULE_DEFINITION(string_literal, String_literal,
    x3::lexeme[quote > *(
        ("\\\"" >> &x3::char_) >> x3::attr(quote) | ~x3::char_(quote)
        ) > quote]
    [handle_escape_sequences];
);

回答1:


I have many examples of this on this site¹

Let met start with simplifying your expression (~charset is likely more efficient than charset - exceptions):

x3::lexeme['"' > *~x3::char_('"')) > '"']

Now, to allow escapes, we can decode them adhoc:

auto qstring = x3::lexeme['"' > *(
         "\\n" >> x3::attr('\n')
       | "\\b" >> x3::attr('\b')
       | "\\f" >> x3::attr('\f')
       | "\\t" >> x3::attr('\t')
       | "\\v" >> x3::attr('\v')
       | "\\0" >> x3::attr('\0')
       | "\\r" >> x3::attr('\r')
       | "\\n" >> x3::attr('\n')
       | "\\"  >> x3::char_("\"\\")
       | ~x3::char_('"')
   ) > '"'];

Alternatively you could use a symbols approach, either including or excluding the slash:

x3::symbols<char> escapes;
escapes.add
    ( "\\n", '\n')
    ( "\\b", '\b')
    ( "\\f", '\f')
    ( "\\t", '\t')
    ( "\\v", '\v')
    ( "\\0", '\0')
    ( "\\r", '\r')
    ( "\\n", '\n')
    ( "\\\\", '\\')
    ( "\\\"", '"');

auto qstring = x3::lexeme['"' > *(escapes | ~x3::char_('"')) > '"'];

See it Live On Coliru as well.

I think I prefer the hand-rolled branches, because they give you flexibility to do e.g. he/octal escapes (mind the conflict with \0 though):

       | "\\" >> x3::int_parser<char, 8, 1, 3>()
       | "\\x" >> x3::int_parser<char, 16, 2, 2>()

Which also works fine:

Live On Coliru

#include <boost/spirit/home/x3.hpp>
#include <iostream>
#include <iomanip>

int main() {
    namespace x3 = boost::spirit::x3;

    auto qstring = x3::lexeme['"' > *(
             "\\n" >> x3::attr('\n')
           | "\\b" >> x3::attr('\b')
           | "\\f" >> x3::attr('\f')
           | "\\t" >> x3::attr('\t')
           | "\\v" >> x3::attr('\v')
           | "\\r" >> x3::attr('\r')
           | "\\n" >> x3::attr('\n')
           | "\\"  >> x3::char_("\"\\")
           | "\\" >> x3::int_parser<char, 8, 1, 3>()
           | "\\x" >> x3::int_parser<char, 16, 2, 2>()
           | ~x3::char_('"')
       ) > '"'];

    for (std::string const input : { R"("\ttest\x41\x42\x43 \x031\x032\x033 \"hello\"\r\n")" }) {
        std::string output;
        auto f = begin(input), l = end(input);
        if (x3::phrase_parse(f, l, qstring, x3::blank, output)) {
            std::cout << "[" << output << "]\n";
        } else {
            std::cout << "Failed\n";
        }
        if (f != l) {
            std::cout << "Remaining unparsed: " << std::quoted(std::string(f,l)) << "\n";
        }
    }
}

Prints

[   testABC 123 "hello"
]

¹ Have a look at these

  • Qi, simple: Replace lit with different string in boost spirit
  • Qi, complete JSON-style: Handling utf-8 in Boost.Spirit with utf-32 parser
  • Qi, practical X500/LDAP distinguished names style: How to parse a grammar into a `std::set` using `boost::spirit`?
  • Qi, practical C-style escapes boost spirit parsing quote string fails


来源:https://stackoverflow.com/questions/61695235/creating-a-boostspiritx3-parser-for-quoted-strings-with-escape-sequence-hand

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!