ANTLR grammar to recognize DIGIT keys and INTEGERS too

问题

I'm trying to create an ANTLR grammar to parse sequences of keys that optionally have a repeat count. For example, (a b c r5) means "repeat keys a, b, and c five times."

I have the grammar working for KEYS : ('a'..'z'|'A'..'Z').

But when I try to add digit keys KEYS : ('a'..'z'|'A'..'Z'|'0'..'9') with an input expression like (a 5 r5), the parse fails on the middle 5 because it can't tell if the 5 is an INTEGER or a KEY. (Or so I think; the error messages are difficult to interpret "NoViableAltException").

I have tried these grammatical forms, which work ('r' means "repeatcount"):

repeat : '(' LETTERKEYS INTEGER ')' - works for a-zA-Z
repeat : '(' LETTERKEYS 'r' INTEGER ')'; - works for a-zA-Z

But I fail with

repeat : '(' LETTERSandDIGITKEYS INTEGER ')' - fails on '(a 5 r5)'
repeat : '(' LETTERSandDIGITKEYS 'r' INTEGER ')'; - fails on '(a 5 r5)'

Maybe the grammar can't do the recognition; maybe I need to recognize all the 5's keys in the same way (as KEYS or DIGITS or INTEGERS) and in the parse tree visitor interpret the middle DIGIT instances as keys, and the last set of DIGITS as an INTEGER count?

Is it possible to define a grammar that allows me to repeat digit keys as well as letter keys so that expressions like (a 5 123 r5) will be recognized correctly? (That is, "repeat keys a,5,1,2,3 five times.") I'm not tied to that specific syntax, although it would be nice to use something similar.

Thank you.

回答1:

the parse fails on the middle 5 because it can't tell if the 5 is an INTEGER or a KEY.

If you have defined the following rules:

INTEGER : [0-9]+;
KEY     : [a-zA-Z0-9];

then a single digit, like 5 in your example, will always become an INTEGER token. Even if the parser is trying to match a KEY token, the 5 will become an INTEGER. There is nothing you can do about that: this is the way ANTLR's lexer works. The lexer works in the following way:

try to consume as many characters as possible (the longest match wins)
if 2 or more rules match the same characters (like INTEGER and KEY in case of 5), let the rule defined first "win"

If you want a 5 to be an INTEGER, but sometimes a KEY, do something like this instead:

key     : KEY | SINGLE_DIGIT | R;
integer : INTEGER | SINGLE_DIGIT;
repeat  : R integer;

SINGLE_DIGIT : [0-9];
INTEGER      : [0-9]+;
R            : 'r';
KEY          : [a-zA-Z];

and in your parser rules, you use key and integer instead of KEY and INTEGER.

回答2:

You can split your grammar into two parts. One to be the lexer grammar, one to be the parser grammar. The lexer grammar splits the input characters into tokens. The parser grammar uses the string of tokens to parse and build a syntax tree. I work on Tunnel Grammar Studio (TGS) that can generate parsers with this two ABNF (RFC 5234) like grammars:

key         = 'a'-'z' / 'A'-'Z' / '0'-'9'
repeater    = 'r' 1*('0'-'9')

That is the lexer grammar that has two rules. Each character that is not processed by the lexer grammar, is converted to a token, made from the character itself. Meaning that a is a key, r11 is a repeater and ( for example will be transferred to the parser as a token (.

document    = *ws repeat
repeat      = '(' *ws *({key} *ws) [{repeater} *ws] ')' *ws
ws          = ' ' / %x9 / %xA / %xD

This is the parser grammar, that has 3 rules. The document rule accepts (recognizes) white space at first, then one repeat rule. The repeat rule starts with a scope, followed by any number of white space. After that is a list of keys maybe separated by white space and after all keys there is an optional repeater token followed by optional white space, closing scope and again optional white space. The white space is space tab carriage return and line feed in that order.

The runtime of this parsing is linear for both the lexer and the parser because both grammars are LL(1). Bellow is the generic parse tree from the TGS online laboratory, where you can run this grammars for input (a 5 r5) and you will get this tree:

If you want to have more complex key, then you may use this:

key = 1*('a'-'z' / 'A'-'Z' / '0'-'9')

In this case however, the key and repeater lexer rules will both recognize the sequence r7, but because the repeater rule is defined later it will take precedence (i.e. overwrites the previous rule). With other words r7 will be a repeater token, and the parsing will still be linear. You will get a warning from TGS if your lexer rules overwrite one another.

来源：https://stackoverflow.com/questions/64548645/antlr-grammar-to-recognize-digit-keys-and-integers-too

标签

parsing

antlr4