问题
I'm writing a parser for a language that looks like the following:
L00<<identifier>>
L10<<keyword>>
L250<<identifier>>
<<identifier>>
That is, each line may or may not start with a line number of the form Lxxx.. ('L' followed by one or more digits) followed by an identifer or a keyword. Identifiers are standard [a-zA-Z_][a-zA-Z0-9_]* and the number of digits following the L is not fixed. Spaces between the line number and following identifer/keyword are optional (and not present in most cases).
My current lexer looks like:
// Parser rules
commands : command*;
command : LINE_NUM? keyword NEWLINE
| LINE_NUM? IDENTIFIER NEWLINE;
keyword : KEYWORD_A | KEYWORD_B | ... ;
// Lexer rules
fragment INT : [0-9]+;
LINE_NUM : 'L' INT;
KEYWORD_A : 'someKeyword';
KEYWORD_B : 'reservedWord';
...
IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]*
However this results in all lines beginning with a LINE_NUM token to be tokenized as IDENTIFIERs.
Is there a way to properly tokenize this input using an ANTLR grammar?
回答1:
You need to add a semantic predicate to IDENTIFIER:
IDENTIFIER
: {_input.getCharPositionInLine() != 0
|| _input.LA(1) != 'L'
|| !Character.isDigit(_input.LA(2))}?
[a-zA-Z_] [a-zA-Z0-9_]*
;
You could also avoid semantic predicates by using lexer modes.
//
// Default mode is active at the beginning of a line
//
LINE_NUM
: 'L' [0-9]+ -> pushMode(NotBeginningOfLine)
;
KEYWORD_A : 'someKeyword' -> pushMode(NotBeginningOfLine);
KEYWORD_B : 'reservedWord' -> pushMode(NotBeginningOfLine);
IDENTIFIER
: ( 'L'
| 'L' [a-zA-Z_] [a-zA-Z0-9_]*
| [a-zA-KM-Z_] [a-zA-Z0-9_]*
)
-> pushMode(NotBeginningOfLine)
;
NL : ('\r' '\n'? | '\n');
mode NotBeginningOfLine;
NotBeginningOfLine_NL : ('\r' '\n'? | '\n') -> type(NL), popMode;
NotBeginningOfLine_KEYWORD_A : KEYWORD_A -> type(KEYWORD_A);
NotBeginningOfLine_KEYWORD_B : KEYWORD_B -> type(KEYWORD_B);
NotBeginningOfLine_IDENTIFIER
: [a-zA-Z_] [a-zA-Z0-9_]* -> type(IDENTIFIER)
;
来源:https://stackoverflow.com/questions/22789057/lexer-to-handle-lines-with-line-number-prefix