问题
I am stuck with a pretty simple grammar. Googling and books reading did not help. I started to use ANTLR quite recently, so probably this is a very simple question.
I am trying to write a very simple Lexer using ANTLR v3.
grammar TestLexer;
options {
language = Java;
}
TEST_COMMENT
: '/*' WS? TEST WS? '*/'
;
ML_COMMENT
: '/*' ( options {greedy=false;} : .)* '*/' {$channel=HIDDEN;}
;
TEST : 'TEST'
;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel=HIDDEN;}
;
The test class:
public class TestParserInvoker {
private static void extractCommandsTokens(final String script) throws RecognitionException {
final ANTLRStringStream input = new ANTLRStringStream(script);
final Lexer lexer = new TestLexer(input);
final TokenStream tokenStream = new CommonTokenStream(lexer);
Token t;
do {
t = lexer.nextToken();
if (t != null) {
System.out.println(t);
}
} while (t == null || t.getType() != Token.EOF);
}
public static void main(final String[] args) throws RecognitionException {
final String script = "/* TEST */";
extractCommandsTokens(script);
}
}
So when test string is "/* TEST */" the lexer produces as expected two tokens. One with type TEST_COMMENT and one with EOF. Everything is OK.
But if test string contains one extra space in the end: "/* TEST */ " lexer produces three tokens: ML_COMMENT, WS and EOF.
Why does first token get ML_COMMENT type? I thought the way how token detected depends only on precedence of lexer rules in grammar. And of course it should not depend on following tokens.
Thanks for help!
P.S. I can use lexer option filter=true - token will get the correct type, but this approach requires extra work in tokens definitions. To be honest, I do not want to use this type of lexer.
回答1:
ANTLR tokenizes the character stream starting from the top rule downwards and tries to match as much as possible. So, yes, I would also have expected a TEST_COMMENT to be created for both "/* TEST */" and "/* TEST */ ". You can always have a look at the generated source code of the lexer to see why it chooses to create a ML_COMMENT for the second input.
Whether this is a bug, or expected behavior, I would not use separate lexer rules that look so much a-like. Could you explain what you're really trying to solve here?
user776872 wrote:
I can use lexer option filter=true - token will get the correct type, but this approach requires extra work in tokens definitions. To be honest, I do not want to use this type of lexer.
I don't quite understand this remark. Are you only interested in a part of the input source? In that case, filter=true is surely a good option. If you want to tokenize all input source, then you shouldn't use filter=true.
EDIT
In case of making a distinction between multi line comments and Javadoc comments, it's best to keep these in the same rule and change the type of the token if it starts with /** like this:
grammar T;
// options
tokens {
DOC_COMMENT;
}
// rules
COMMENT
: '/*' (~'*' .*)? '*/'
| '/**' ~'/' .* '*/' {$type=DOC_COMMENT;}
;
Note that both .* and .+ are by default non-greedy in ANTLR (contrary to popular belief).
Demo
grammar T;
tokens {
DOC_COMMENT;
}
@parser::members {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRStringStream("/**/ /*foo*/ /**bar*/"));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
parse
: (t=. {System.out.println(tokenNames[$t.type] + " :: " + $t.text);})* EOF
;
COMMENT
: '/*' (~'*' .*)? '*/'
| '/**' ~'/' .* '*/' {$type=DOC_COMMENT;}
;
SPACE
: ' ' {$channel=HIDDEN;}
;
which produces:
bart@hades:~/Programming/ANTLR/Demos/T$ java -cp antlr-3.3.jar org.antlr.Tool T.g bart@hades:~/Programming/ANTLR/Demos/T$ javac -cp antlr-3.3.jar *.java bart@hades:~/Programming/ANTLR/Demos/T$ java -cp .:antlr-3.3.jar TParser COMMENT :: /**/ COMMENT :: /*foo*/ DOC_COMMENT :: /**bar*/
来源:https://stackoverflow.com/questions/6181585/token-type-depends-on-following-token