ANTLR parse strings (keep whitespaces) and parse normal identifiers

我是研究僧i 提交于 2019-12-11 19:03:06

问题


I am trying to use ANTLR4 to parse source files. One thing I need to do is that a string literal contains all kinds of characters and possibly white spaces while normal identifiers contains only English characters and digits (white spaces are thrown away).

I use the following antlr grammar rules (the minimal example), but it doesn't work as expected.

grammar parseString;

rules
    :   stringRule+
    ;

stringRule
    :   formatString
    |   idString
;

formatString
    :   STRING_DOUBLEQUOTE    STRING  STRING_DOUBLEQUOTE
    ;

idString
    :   (NONTERM | TERM)
    ;

// LEXER

STRING_DOUBLEQUOTE
    :   '"' ;

DIGITS
    :   DIGIT+
    ;

TERM
    :   UPPERCHAR CHAR+
    ;

NONTERM
    :   LOWERCHAR CHAR+
    ;

fragment
CHAR
    :   LOWERCHAR
    |   UPPERCHAR
    |   DIGIT
    |   '-'
    |   '_'
    ;

fragment
DIGIT
    :   [0-9]
    ;

fragment
LOWERCHAR
    :   [a-z]
    ;

fragment
UPPERCHAR
    :   [A-Z]
    ;

WS 
    :   (' ' | '\t' | '\r' | '\n')+ -> skip 
    ; // skip spaces, tabs, newlines

LINE_COMMENT
    :   '//' ~[\r\n]* -> skip
    ;

STRING
    :   ~('"')*
    ;

For the test cases that I use,

Test
HelloWorld
"$this is a string"
"*this is another string!"

I got the error line 1:0 extraneous input 'Test\nHelloWorld\n' expecting {'"', TERM, NONTERM}. And the last two lines of the 'formatString' are correctly parsed. But for the first two lines, since the newline characters ('\n') haven't got thrown away, thus they are not matched to 'idString'. I am wondering what I did wrong.


回答1:


Your STRING rule will match anything but quotes so will scarf just about anything. That is way too loose. You will need a much tighter definition of exactly what distinguishes a STRING from the others I think. Once it's in ~'"'* it will scarf until '"'.




回答2:


Yes there is a problem in this grammar. the token STRING matchs 'Test\nHelloWorld\n'. It will put everything in this token, but there is no rule that takes just the TOKEN STRING.

Think about changing the token STRING.



来源:https://stackoverflow.com/questions/23731646/antlr-parse-strings-keep-whitespaces-and-parse-normal-identifiers

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!