问题
I want to read an input stream and divide the input into 2 types: PATTERN & WORD_WEIGHT, which are defined below.
The problem arises from the fact that all the chars defined for a WORD_WEIGHT are also valid for a PATTERN. When we have multiple WORD_WEIGHTs without spaces between the lexer will match PATTERN rather than deliver multiple WORD_WEIGHTs.
I need to be able to handle the following cases and get the indicated result:
- [20] => WORD_WEIGHT
- cat => PATTERN
- [dog] => PATTERN
And this case, which is the problem. It matches PATTERN because the lexer will select the longer of the 2 possibilities. Note: there's no space between them.
- [20][30] => WORD_WEIGHT WORD_WEIGHT
Also need to handle this case (which imposes some limits on the possible solutions). Note that the brackets may not be matching for a PATTERN...
- [[[cat] => PATTERN
Here's the grammar:
grammar Brackets;
fragment
DIGIT
: ('0'..'9')
;
fragment
WORD_WEIGHT_VALUE
: ('-' | '+')? DIGIT+ ('.' DIGIT+)?
| ('-' | '+')? '.' DIGIT+
;
WORD_WEIGHT
: '[' WORD_WEIGHT_VALUE ']'
;
PATTERN
: ~(' ' | '\t' | '\r' | '\n' )+
;
WS
: (' ' | '\t' | '\r' | '\n' )+ -> Skip
;
start : (PATTERN | WORD_WEIGHT)* EOF;
The question is, what Lexer rules would give the desired result?
I'm wishing for a feature, a special directive that one can specify for a lexer rule that affects the matching process. It would instruct the lexer, upon a match of the rule, to stop the matching process and use this matched token.
FOLLOW-UP - THE SOLUTION WE CHOSE TO PURSUE:
Replace WORD_WEIGHT above with:
fragment
WORD_WEIGHT
: '[' WORD_WEIGHT_VALUE ']'
;
WORD_WEIGHTS
: WORD_WEIGHT (INNER_WS? WORD_WEIGHT)*
;
fragment
INNER_WS
: (' ' | '\t' )+
;
Also, the Grammar rule becomes:
start : (PATTERN | WORD_WEIGHTS)* EOF;
Now, any sequence of word weights (either space separated or not), will be the value of WORD_WEIGHTS token. This happens to match our usage too - our grammar (not in the snippet above) always defines word weights as "one or more". Now, the multiplicity is "captured" by the lexer instead of the parser. If/when we need to process each word weight separately we can split the value in the application (parse tree listener).
回答1:
You can implement WORD_WEIGHT as follows:
WORD_WEIGHT
: '[' WORD_WEIGHT_VALUE ']'
PATTERN?
;
Then, in your lexer, you can override the emit method to correct the position of the lexer to remove the PATTERN (if any) which was added to the end of the WORD_WEIGHT token. You can see examples of this in ANTLRWorks 2:
- The LBRACE token in StringTemplate 4 is modified by this code.
- The DELIMITERS token in StringTemplate 4 is modified by this code.
The modification requires the following steps.
- Override
LexerATNSimulatorto add the resetAcceptPosition method. - Set the
_interpfield to an instance of your customLexerATNSimulatorin the constructor for your lexer class. - Calculate the desired end position for your token, and call
resetAcceptPosition. For fixed-width tokens like you see in the ST4 examples, the calculation was simply the length of the fixed operator or keyword which appeared at the beginning of the token. For your case, you will need to callgetText()and examine the result to determing the correct length of yourWORD_WEIGHTtoken. Since theWORD_WEIGHT_VALUErule cannot match], the easiest analysis would probably be to find the index of the first]character in the result ofgetText()(the syntax ofWORD_WEIGHTensures the character will always exist).
来源:https://stackoverflow.com/questions/23813487/lexer-overlapping-rule-but-want-the-shorter-match