问题
How is ANTLR lexer behavior defined in the case of conflicting tokens? Let me explain what I mean by "conflicting" tokens. For example, assume that the following is defined:
INT_STAGE : '1'..'6';
INT : '0'..'9'+;
There is a conflict here, because after reading a sequence of digits, the lexer would not know whether there is one INT or many INT_STAGE tokens (or different combinations of both). After a test, it looks like that if INT is defined after INT_STAGE, the lexer would prefer to find INT_STAGE, but maybe not INT then? Otherwise, no INT_STAGE would ever be found.
Another example would be:
FOOL: ' fool'
FOO: 'foo'
ID : ('a'..'z'|'A'..'Z'|'_'|'%') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'%')*;
I was told that this is the "right" order to recognize all the tokens: while reading "fool" the lexer will find one FOOL token and not FOO ID or something else.
回答1:
The following logic applies:
- the lexer matches as much characters as possible
- if after applying rule 1, there are 2 or more rules that match the same amount of characters, the rule defined first will "win"
Taking this into account, the input "1"
, "2"
, ..., "6"
is tokenized as an INT_STAGE
: both INT_STAGE
and INT
match the same amount of characters, but INT_STAGE
is defined first.
The input "12"
is tokenized as a INT
since it matches the most characters.
I was told that this is the "right" order to recognize all the tokens: while reading "fool" the lexer will find one FOOL token and not FOO ID or something else.
That is correct.
来源:https://stackoverflow.com/questions/34592107/antlr-behaviour-with-conflicting-tokens