Does the Peg.js engine backstep after a lookahead like regexs do?

邮差的信 提交于 2019-12-25 00:46:58

问题


According to regular-expressions.info on lookarounds, the engine backsteps after a lookahead:

Let's take one more look inside, to make sure you understand the implications of the lookahead. Let's apply q(?=u)i to quit. The lookahead is now positive and is followed by another token. Again, q matches q and u matches u. Again, the match from the lookahead must be discarded, so the engine steps back from i in the string to u. The lookahead was successful, so the engine continues with i. But i cannot match u. So this match attempt fails. All remaining attempts fail as well, because there are no more q's in the string.

However, in Peg.js it SEEMS like the engine still moves passed the & or ! so that in fact it isn't a lookahead in the same sense as regexps but a decision on consumption, and there is no backstepping, and therefor no true looking ahead.

Is this the case?

(If so then certain parsearen't even possible, like this one?)


回答1:


Lookahead works similar to how it does in a regex engine.

This query fails to match because the next letter should be 'u', not 'i'.

word = 'q' &'u' 'i' 't'

This query succeeds:

word = 'q' &'u' 'u' 'i' 't'

This query succeeds:

word = 'q' 'u' 'i' 't'

As for your example, try something along these lines, you shouldn't need to use lookaheads at all:

expression
    = termPair ( _ delimiter _ termPair )*

termPair
    = term ('.' term)? ' ' term ('.' term)?

term "term"
    = $([a-z0-9]+)

delimiter "delimiter"
    = "."

_ "whitespace"
    = [ \t\n\r]+

EDIT: Added another example per comments below.

expression
    = first:term rest:delimTerm* { return [first].concat(rest); }

delimTerm
    = delimiter t:term { return t; }

term "term"
    = $((!delimiter [a-z0-9. ])+)

delimiter "delimiter"
    = _ "." _

_ "whitespace"
    = [ \t\n\r]+

EDIT: Added extra explanation of the term expression.

I'll try to break down the term rule a bit $((!delimiter [a-z0-9. ])+).

$() converts everything inside to a single text node like [].join('').

A single "character" of a term is any character [a-z0-9. ], if we wanted to simplify it, we could say . instead. Before matching the character we want to lookahead for a delimiter, if we find a delimiter we stop matching that character. Since we want multiple characters we do the whole thing multiple times with +.

It think it's a common idiom in PEG parsers to move forward this way. I learned the idea from the treetop documentation for matching a string.



来源:https://stackoverflow.com/questions/52894116/does-the-peg-js-engine-backstep-after-a-lookahead-like-regexs-do

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!