Parsing a tokenized free form grammar with Boost.Spirit

柔情痞子 提交于 2019-12-24 16:46:30

问题


I've got stuck trying to create a Boost.Spirit parser for the callgrind tool's output which is part of valgrind. Callgrind outputs a domain specific embedded programming language (DSEL) which lets you do all sorts of cool stuff like custom expressions for synthetic counters, but it's not easy to parse.

I've placed some sample callgrind output at https://gist.github.com/ned14/5452719#file-sample-callgrind-output. I've placed my current best attempt at a Boost.Spirit lexer and parser at https://gist.github.com/ned14/5452719#file-callgrindparser-hpp and https://gist.github.com/ned14/5452719#file-callgrindparser-cxx. The Lexer part is straightforward: it tokenises tag-values, non-whitespace text, comments, end of lines, integers, hexadecimals, floats and operators (ignore the punctuators in the sample code, they're unused). White space is skipped.

So far so good. The problem is parsing the tokenised input stream. I haven't even attempted the main stanzas yet, I'm still trying to parse the tag-values which can occur at any point in the file. Tag values look like this:

tagtext: unknown series of tokens<eol>

It could be freeform text e.g.

desc: I1 cache: 32768 B, 64 B, 8-way associative, 157 picosec hit latency

In this situation you'd want to convert the set of tokens to a string i.e. to an iterator_range and extract.

It could however be an expression e.g.

event: EPpsec = 316 Ir + 1120 I1mr + 1120 D1mr + 1120 D1mw + 1362 ILmr + 1362 DLmr + 1362 DLmw

This says that from now on, event EPpsec is to be synthesised as Ir multiplied by 316 added to I1mr multiplied by 1120 added to ... etc.

The point I want to make here is that tag-value pairs need to be accumulated as arbitrary sets of tokens, and post-processed into whatever we turn them into later.

To that end, Boost.Spirit's utree() class looked exactly what I wanted, and that's what the sample code uses. But on VS2012 using the November CTP compiler with variadic templates I'm currently seeing this compile error:

1>C:\Users\ndouglas.RIMNET\documents\visual studio 2012\Projects\CallgrindParser\boost\boost/range/iterator_range_core.hpp(56): error C2440: 'static_cast' : cannot convert from 'boost::spirit::detail::list::node_iterator<const boost::spirit::utree>' to 'base_iterator_type'
1>          No constructor could take the source type, or constructor overload resolution was ambiguous
1>          C:\Users\ndouglas.RIMNET\documents\visual studio 2012\Projects\CallgrindParser\boost\boost/range/iterator_range_core.hpp(186) : see reference to function template instantiation 'IteratorT boost::iterator_range_detail::iterator_range_impl<IteratorT>::adl_begin<const Range>(ForwardRange &)' being compiled
1>          with
1>          [
1>              IteratorT=base_iterator_type
1>  ,            Range=boost::spirit::utree
1>  ,            ForwardRange=boost::spirit::utree
1>          ]

... which suggests that my base_iterator_type, which is a Boost.Spirit multi_pass<> wrap of an istreambuf_iterator for forward iterator nature, is somehow not understood by Boost.Spirit's utree() implementation. Thing is, I'm not sure if this is my bad code or bad Boost.Spirit code seeing as line_pos_iterator<> was failing to correctly specify its forward_iterator concept tag.

Thanks to past Stackoverflow help I could write a pure non-tokenised grammar, but it would be brittle. The right solution is to tokenise and use a freeform grammar capable of fairly arbitrary input. The number of examples of getting Boost.Spirit's Lex and Grammar working together in real world examples to achieve this rather than toy examples is sadly very few. Therefore any help would be greatly appreciated.

Niall


回答1:


The token attribute exposes a variant, which in addition to the base-iterator range, can _assume the types declared in the token_type typedef:

typedef lex::lexertl::token<base_iterator_type, mpl::vector<std::string, int, double>> token_type;

So: string, int and double. Note however that coercion into one of the possible types will only occur lazily, when the parser actually uses the value.

utrees are a very versatile container [1]. Hence, when you expose a spirit::utree attribute on a rule, and the token value variant contains an iterator_range, then it attempts to assign that into the utree object (this fails, because the iterators are ... 'funky').

The easiest way to get your desired behaviour is to force Qi to interpret the attribute of the tag token as a string, and have that assigned to the utree. Therefore the following line constitutes a fix that will make compilation succeed:

    unknowntagvalue = qi::as_string[tok.tag] >> restofline;

Notes

Having said all this, I would indeed suggest the following

  • Consider using the Nabialek Trick to dispatch different lazy rules depending on the tag matched - this makes it unnecessary to deal with raw utrees later on

  • You might have had success specializing boost::spirit::traits::assign_to_XXXXXX traits (see documentation)

  • consider using a pure Qi parser. While I can "feel" your sentiment that "it is going to brittle" [2] it seems you have already demonstrated that it raises the complexity to such a degree that it might not have net merit:

    • the unexpected ways in which attributes materialize (this question)
    • the problem with line-pos iterators (this is frequently asked question, and AFAIR it has mostly hard or inelegant solutions)
    • the inflexibility regarding e.g. ad-hoc debugging (access to source data in SA), switching/disabling skippers etc.
    • my personal experience was that looking at lexer states to drive these isn't helpful, because switching lexer state can only work reliably from lexer token semantic actions, whereas often, the disambiguation would happen in the Qi phase

but I'm diverging :)


[1] e.g. they have facilities for very lightweight 'referencing' of iterator ranges (e.g. for symbols, or to avoid copying characters from a source buffer into the attribute unless wanted)

[2] In effect, only because using a sequential lexer (scanner) vastly reduces the number of backtrack opportunities, so it simplifies the mental model of the parser. However, you can use expectation points to much the same effect.



来源:https://stackoverflow.com/questions/16196065/parsing-a-tokenized-free-form-grammar-with-boost-spirit

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!