Antlr4 doesn't correctly recognizes unicode characters

落爺英雄遲暮 提交于 2019-12-11 03:54:01

问题


I've very simple grammar which tries to match 'é' to token E_CODE. I've tested it using TestRig tool (with -tokens option), but parser can't correctly match it. My input file was encoded in UTF-8 without BOM and I've used ANTLR version 4.4. Could somebody else also check this ? I got this output on my console:
line 1:0 token recognition error at: 'Ă'

grammar Unicode;

stat:EOF;  
E_CODE: '\u00E9' | 'é';

回答1:


I tested the grammar:

grammar Unicode;

stat: E_CODE* EOF;

E_CODE: '\u00E9' | 'é';

as follows:

UnicodeLexer lexer = new UnicodeLexer(new ANTLRInputStream("\u00E9é"));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());

and the following got printed to my console:

éé<EOF>

Tested with 4.2 and 4.3 (4.4 isn't in Maven Central yet).

EDIT

Looking at the source I see TestRig takes an optional -encoding param. Have you tried setting it?




回答2:


Your grammar file is not saved in utf8 format. Utf8 is default format that antlr accept as input grammar file, according with terence Parr book.



来源:https://stackoverflow.com/questions/26549393/antlr4-doesnt-correctly-recognizes-unicode-characters

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!