问题
Consider this very simplified example where an input of the following form should be matched
mykey -> This is the value
My real case is much more complex but this will do for showing what I try to achieve. mykey is an ID while on the right side of -> we have a set of Words. If I use
grammar Root;
parse
: ID '->' value
;
value
: Word+
;
ID
: ('a'..'z')+
;
Word
: ('a'..'z' | 'A'..'Z' | '0'..'9')+
;
WS
: ' ' -> skip
;
the example won't be parsed because the lexer will give an ID token for the first is which is not matched by Word+. In my real example, the value-language is vastly different and I'd like to parse it with a different grammar.
I have considered different solutions:
Switching the lexer
modebut AFAIK, switching the lexer to a different mode can only happen in a lexer rule. This is problematic for this case and my real case as well as there are no unique tokens that start and end thevaluepart. What I would need is something like "tokenizevaluewith different rules" which is, of course, stupid, because lexer and parser act independently and as soon as the parser starts, everything is already tokenizedUsing a different grammar for
value. When I see this right, the approach of importing a grammar won't work, since it always combines two grammars leading to the same situation of wrong tokenization.Creating a first crude parser, that accepts the whole language but doesn't create the correct tree for
value. I could then use a visitor and reparsevaluenodes with a different sub-parser possibly inserting a new, correct subtree for value. This feels a bit clumsy.
If you need a simple real-world application, then you could consider strings in Java. Some of them might be a regex which needs to be parsed with a completely different parser. It is similar to injected languages you can use inside IDEA.
Question: Is there an idiomatic way in ANTRL4 to parse a specific rule with a different grammar? Best case would be if I can specify this on the grammar level so that the resulting AST is a combination of the outer language that contains a sub-tree of the injected language.
回答1:
Actually, using modes is the idiomatic solution. Just requires being a bit creative in identifying the mode guards:
parser grammar RootParser ;
options {
tokenVocab = RootLexer ;
}
parse : ID RARROW value EOF ;
value : WORD+ ;
and
lexer grammar RootLexer ;
ID : [a-z]+ ;
RARROW : '->' -> pushMode(value) ;
mode value ;
EOL : [\r\n]+ -> popMode, skip ;
WORD : [a-zA-Z0-9]+ ;
WS : ' ' -> skip ;
回答2:
You can try to transfert the decision what a word is into the parser:
grammar Root;
parse
: ID '->' value
;
value
: word+
;
word : Word | ID;
//the same lexer rules as above
This will parse
This -> Word -> word
is -> ID -> word
the -> ID -> word
value -> ID -> word
So at the level of the parser nodes you will have only words.
来源:https://stackoverflow.com/questions/47108761/antlr4-invoke-different-sub-parser-for-specific-rule