Internal Boost::Spirit code segfaults when parsing a composite grammar

前端未结

关注

 1  466

爱一瞬间的悲伤 2021-01-14 14:49

I\'m trying to use Spirit to parse expressions of the form Module1.Module2.value (any number of dot-separated capitalized identifiers, then a dot, then a lowerc

1条回答

旧时难觅i (楼主)

2021-01-14 15:17

You're running into trouble because expression templates keep internal references to temporaries.

Simply aggregate the sub-parser instances:

template 
struct value_path : grammar, std::string>()> {
    value_path() : value_path::base_type(start)
    {
        start = -(module_path_ >> '.') >> value_name_;
    }
  private:

    rule, std::string>()> start;
    module_path module_path_;
    value_name value_name_;
};

Notes I feel it might be a design smell to use separate sub-grammars for such small items. Although grammar decomposition is frequently a good idea to keep build times manageable and code size somewhat lower, but it seems - from the description here - you might be overdoing things.

The "plastering" of parser expressions behind a qi::rule (effectively type erasure) comes with a possibly significant runtime overhead. If you subsequently instantiate those for more than a single iterator type, you may be compounding this with unnecessary growth of the binary.

UPDATE Regarding the idiomatic way to compose your grammars in Spirit, here's my take:

Live On Coliru

using namespace ascii;
using qi::raw;

lowercase_ident  = raw[ (lower | '_') >> *(alnum | '_' | '\'') ];
module_path_item = raw[ upper >> *(alnum | '_' | '\'') ];
module_path_     = module_path_item % '.';

auto special_char = boost::proto::deep_copy(char_("-+!$%&*./:<=>?@^|~"));

operator_name = qi::raw [
          ('!' >> *special_char)                          /* branch 1     */
        | (char_("~?") >> +special_char)                  /* branch 2     */
        | (!char_(".:") >> special_char >> *special_char) /* branch 3     */
        | "mod"                                           /* branch 4     */
        | "lor" | "lsl" | "lsr" | "asr" | "or"            /* branch 5-9   */
        | "-."                                            /* branch 10    doesn't match because of branch 3   */
        | "!=" | "||" | "&&" | ":="                       /* branch 11-14 doesn't match because of branch 1,3 */
     // | (special_char - char_("!$%./:?@^|~"))           /* "*+=<>&-" cannot match because of branch 3 */
    ]
    ;

value_name_  = 
      lowercase_ident
    | '(' >> operator_name >> ')'
    ;

start = -(module_path_ >> '.') >> value_name_;

Where the rules are fields declared as:

qi::rule start;
qi::rule module_path_;

// lexeme: (no skipper)
qi::rule value_name_, module_path_item, lowercase_ident, operator_name;

Notes:

I've added a skipper, because since your value_path grammar didn't use one, any skipper you passed into qi::phrase_parse was being ignored
The lexemes just drop the skipper from the rule declaration type, so you don't even need to specify qi::lexeme[]
In the lexemes, I copied your intention to just copy the parsed text verbatim using qi::raw. This allows us to write grammars more succinctly (using '!' instead of char_('!'), "mod" instead of qi::string("mod")). Note that bare literals are implicitly transformed into "non-capturing" qi::lit(...) nodes in the context of a Qi parser expression, but since we used raw[] anyways, the fact that lit doesn't capture an attribute is not a problem.

I think this results in a perfectly cromulent grammar definition that should satisfy your criteria for "high-level". There's some wtf-y-ness with the grammar itself (regardless of its expression any parser generator language, likely):

I've simplified the operator_name rule by removing nesting of alternative branches that will result in the same effect as the simplified flat alternative list
I've refactored the "magic" lists of special characters into special_chars
In alternative branch 3, e.g., I've noted the exceptions with a negative assertion:
```
(!char_(".:") >> special_char >> *special_char) /* branch 3     */
```
The !char_(".:") assertion says: when the input wouldn't match '.' or ':' continue matching (any sequence of special characters). In fact you could equivalently write this as:
```
((special_char - '.' - ':') >> *special_char) /* branch 3     */
```
or even, as I ended up writing it:
```
(!char_(".:") >> +special_char) /* branch 3     */
```

The simplification of the branches actually raises the level of abstraction! It becomes clear now, that some of the branches will never match, because earlier branches match the input by definition:

   | "-."                                    /* branch 10    doesn't match because of branch 3   */
   | "!=" | "||" | "&&" | ":="               /* branch 11-14 doesn't match because of branch 1,3 */
// | (special_char - char_("!$%./:?@^|~"))   /* "*+=<>&-" cannot match because of branch 3 */

I hope you can see why I qualify this part of the grammar as "a little bit wtf-y" :) I'll assume for now that you got confused or something went wrong when you reduces it to a single rules (your "fool's errand").

Some further improvements to be noted:

I've added a proper AST struct instead of the boost::tuple<> to make the code more legible
I've added BOOST_SPIRIT_DEBUG* macros so you can debug your grammar at a high level (the rule level)
I've ditched the blanket using namespace. This is generally a bad idea. And with Spirit it is frequently a very bad idea (it can lead to ambiguities that are unsolvable, or to very hard to spot errors). As you can see, it doesn't necessarily lead to very verbose code.

Full Listing

#define BOOST_SPIRIT_DEBUG
#include 
#include 

namespace qi    = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;

namespace ast {
    using module_path = std::vector;
    struct value_path {
        module_path module;
        std::string   value_expr;
    };
}

BOOST_FUSION_ADAPT_STRUCT(ast::value_path, (ast::module_path, module)(std::string,value_expr))

template 
struct value_path : qi::grammar {
    value_path() : value_path::base_type(start)
    {
        using namespace ascii;
        using qi::raw;

        lowercase_ident  = raw[ (lower | '_') >> *(alnum | '_' | '\'') ];
        module_path_item = raw[ upper >> *(alnum | '_' | '\'') ];
        module_path_     = module_path_item % '.';

        auto special_char = boost::proto::deep_copy(char_("-+!$%&*./:<=>?@^|~"));

        operator_name = qi::raw [
                  ('!'          >> *special_char)         /* branch 1     */
                | (char_("~?")  >> +special_char)         /* branch 2     */
                | (!char_(".:") >> +special_char)         /* branch 3     */
                | "mod"                                   /* branch 4     */
                | "lor" | "lsl" | "lsr" | "asr" | "or"    /* branch 5-9   */
                | "-."                                    /* branch 10    doesn't match because of branch 3   */
                | "!=" | "||" | "&&" | ":="               /* branch 11-14 doesn't match because of branch 1,3 */
             // | (special_char - char_("!$%./:?@^|~"))   /* "*+=<>&-" cannot match because of branch 3 */
            ]
            ;

        value_name_  = 
              lowercase_ident
            | '(' >> operator_name >> ')'
            ;

        start = -(module_path_ >> '.') >> value_name_;

        BOOST_SPIRIT_DEBUG_NODES((start)(module_path_)(value_name_)(module_path_item)(lowercase_ident)(operator_name))
    }
  private:
    qi::rule start;
    qi::rule module_path_;

    // lexeme: (no skipper)
    qi::rule value_name_, module_path_item, lowercase_ident, operator_name;
};

int main()
{
    for (std::string const input : { 
            "Some.Module.Package.ident",
            "ident",
            "A.B.C_.mod",    // as lowercase_ident
            "A.B.C_.(mod)",  // as operator_name (branch 4)
            "A.B.C_.(!=)",   // as operator_name (branch 1)
            "(!)"            // as operator_name (branch 1)
            })
    {
        std::cout << "--------------------------------------------------------------\n";
        std::cout << "Parsing '" << input << "'\n";

        using It = std::string::const_iterator;
        It f(input.begin()), l(input.end());

        value_path g;
        ast::value_path data;
        bool ok = qi::phrase_parse(f, l, g, ascii::space, data);
        if (ok) {
            std::cout << "Parse succeeded\n";
        } else {
            std::cout << "Parse failed\n";
        }

        if (f!=l)
            std::cout << "Remaining unparsed: '" << std::string(f,l) << "'\n";
    }
}

Debug Output

--------------------------------------------------------------
Parsing 'Some.Module.Package.ident'

  Some.Module.Package.
  
    Some.Module.Package.
    
      Some.Module.Package.
      .Module.Package.iden
      [[S, o, m, e]]
    
    
      Module.Package.ident
      .Package.ident
      [[M, o, d, u, l, e]]
    
    
      Package.ident
      .ident
      [[P, a, c, k, a, g, e]]
    
    
      ident
      
    
    .ident
    [[[S, o, m, e], [M, o, d, u, l, e], [P, a, c, k, a, g, e]]]
  
  
    ident
    
      ident
      
      [[i, d, e, n, t]]
    
    
    [[i, d, e, n, t]]
  
  
  [[[[S, o, m, e], [M, o, d, u, l, e], [P, a, c, k, a, g, e]], [i, d, e, n, t]]]

Parse succeeded
--------------------------------------------------------------
Parsing 'ident'

  ident
  
    ident
    
      ident
      
    
    
  
  
    ident
    
      ident
      
      [[i, d, e, n, t]]
    
    
    [[i, d, e, n, t]]
  
  
  [[[], [i, d, e, n, t]]]

Parse succeeded
--------------------------------------------------------------
Parsing 'A.B.C_.mod'

  A.B.C_.mod
  
    A.B.C_.mod
    
      A.B.C_.mod
      .B.C_.mod
      [[A]]
    
    
      B.C_.mod
      .C_.mod
      [[B]]
    
    
      C_.mod
      .mod
      [[C, _]]
    
    
      mod
      
    
    .mod
    [[[A], [B], [C, _]]]
  
  
    mod
    
      mod
      
      [[m, o, d]]
    
    
    [[m, o, d]]
  
  
  [[[[A], [B], [C, _]], [m, o, d]]]

Parse succeeded
--------------------------------------------------------------
Parsing 'A.B.C_.(mod)'

  A.B.C_.(mod)
  
    A.B.C_.(mod)
    
      A.B.C_.(mod)
      .B.C_.(mod)
      [[A]]
    
    
      B.C_.(mod)
      .C_.(mod)
      [[B]]
    
    
      C_.(mod)
      .(mod)
      [[C, _]]
    
    
      (mod)
      
    
    .(mod)
    [[[A], [B], [C, _]]]
  
  
    (mod)
    
      (mod)
      
    
    
      mod)
      )
      [[m, o, d]]
    
    
    [[m, o, d]]
  
  
  [[[[A], [B], [C, _]], [m, o, d]]]

Parse succeeded
--------------------------------------------------------------
Parsing 'A.B.C_.(!=)'

  A.B.C_.(!=)
  
    A.B.C_.(!=)
    
      A.B.C_.(!=)
      .B.C_.(!=)
      [[A]]
    
    
      B.C_.(!=)
      .C_.(!=)
      [[B]]
    
    
      C_.(!=)
      .(!=)
      [[C, _]]
    
    
      (!=)
      
    
    .(!=)
    [[[A], [B], [C, _]]]
  
  
    (!=)
    
      (!=)
      
    
    
      !=)
      )
      [[!, =]]
    
    
    [[!, =]]
  
  
  [[[[A], [B], [C, _]], [!, =]]]

Parse succeeded
--------------------------------------------------------------
Parsing '(!)'

  (!)
  
    (!)
    
      (!)
      
    
    
  
  
    (!)
    
      (!)
      
    
    
      !)
      )
      [[!]]
    
    
    [[!]]
  
  
  [[[], [!]]]

Parse succeeded

0 讨论(0)