ANTLR4 lexer not resolving ambiguity in grammar order

烂漫一生 提交于 2020-01-15 04:50:14

问题


Using ANTLR 4.2, I'm trying a very simple parse of this test data:

RRV0#ABC

Using a minimal grammar:

grammar Tiny;

thing : RRV N HASH ID ;

RRV : 'RRV' ;
N : [0-9]+ ;
HASH : '#' ;
ID : [a-zA-Z0-9]+ ;
WS : [\t\r\n]+ -> skip ; // match 1-or-more whitespace but discard

I expect the lexer RRV to match before ID, based on the excerpt below from Terence Parr's Definitive ANTLR 4 reference:

BEGIN : 'begin' ; // match b-e-g-i-n sequence; ambiguity resolves to BEGIN
ID : [a-z]+ ; // match one or more of any lowercase letter

Running the ANTLR4 test rig with the test data above, the output is

[@0,0:3='RRV0',<4>,1:0]
[@1,4:4='#',<3>,1:4]
[@2,5:7='ABC',<4>,1:5]
[@3,10:9='<EOF>',<-1>,2:0]
line 1:0 mismatched input 'RRV0' expecting 'RRV'

I can see the first token is <4> for ID, with the value 'RRV0'

I have tried rearranging the lexer item order. I have also tried using implicit lexer items by explicitly matching in the grammar rule (rather than through an explicit lexer item). I tried making matches non greedy too. Those were not successful for me.

If I change the lexed ID item to not match upper case then the RRV item does match and the parse will get further.

I started in ANTLR 4.1 with the same issue.

I checked in ANTLRWorks and from the command line, with the same result both ways.

How can I change the grammar to match lexer item RRV in preference to ID ?


回答1:


The grammar order resolution policy only applies when two different lexer rules match the same length of token. When the length differs, the longest one always wins. In your case, the ID rule matches a token with length 4, which is longer than the RRV token that only matches 3 characters.

This strategy is especially important in languages like Java. Consider the following input:

String className = "";

Along with the following two grammar rules (slightly simplified):

CLASS : 'class';
ID : [a-zA-Z_] [a-zA-Z0-9_]*;

If we only considered grammar order, then the input className would produce a keyword followed by the identifier Name. Rearranging the rules wouldn't solve the problem because then there would be no way to ever create a CLASS token, even for the input class.



来源:https://stackoverflow.com/questions/21579350/antlr4-lexer-not-resolving-ambiguity-in-grammar-order

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!