问题
I am trying to use ANTLR4 to parse source files. One thing I need to do is that a string literal contains all kinds of characters and possibly white spaces while normal identifiers contains only English characters and digits (white spaces are thrown away).
I use the following antlr grammar rules (the minimal example), but it doesn't work as expected.
grammar parseString;
rules
: stringRule+
;
stringRule
: formatString
| idString
;
formatString
: STRING_DOUBLEQUOTE STRING STRING_DOUBLEQUOTE
;
idString
: (NONTERM | TERM)
;
// LEXER
STRING_DOUBLEQUOTE
: '"' ;
DIGITS
: DIGIT+
;
TERM
: UPPERCHAR CHAR+
;
NONTERM
: LOWERCHAR CHAR+
;
fragment
CHAR
: LOWERCHAR
| UPPERCHAR
| DIGIT
| '-'
| '_'
;
fragment
DIGIT
: [0-9]
;
fragment
LOWERCHAR
: [a-z]
;
fragment
UPPERCHAR
: [A-Z]
;
WS
: (' ' | '\t' | '\r' | '\n')+ -> skip
; // skip spaces, tabs, newlines
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
STRING
: ~('"')*
;
For the test cases that I use,
Test
HelloWorld
"$this is a string"
"*this is another string!"
I got the error line 1:0 extraneous input 'Test\nHelloWorld\n' expecting {'"', TERM, NONTERM}. And the last two lines of the 'formatString' are correctly parsed. But for the first two lines, since the newline characters ('\n') haven't got thrown away, thus they are not matched to 'idString'. I am wondering what I did wrong.
回答1:
Your STRING rule will match anything but quotes so will scarf just about anything. That is way too loose. You will need a much tighter definition of exactly what distinguishes a STRING from the others I think. Once it's in ~'"'* it will scarf until '"'.
回答2:
Yes there is a problem in this grammar. the token STRING matchs 'Test\nHelloWorld\n'. It will put everything in this token, but there is no rule that takes just the TOKEN STRING.
Think about changing the token STRING.
来源:https://stackoverflow.com/questions/23731646/antlr-parse-strings-keep-whitespaces-and-parse-normal-identifiers