ANTLR4: Using non-ASCII characters in token rules

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-28 12:16:49

ANTLR is ready to accept 16-bit characters but, by default, many locales will read in characters as bytes (8 bits). You need to specify the appropriate encoding when you read from the file using the Java libraries. If you are using the TestRig, perhaps through alias/script grun, then use argument -encoding utf-8 or whatever. If you look at the source code of that class, you will see the following mechanism:

InputStream is = new FileInputStream(inputFile);
Reader r = new InputStreamReader(is, encoding); // e.g., euc-jp or utf-8
ANTLRInputStream input = new ANTLRInputStream(r);
XLexer lexer = new XLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
...

For those having the same problem using antlr4 in java code, ANTLRInputStream beeing deprecated, here is a working way to pass multi-char unicode data from a String to a the MyLexer lexer :

    String myString = "\u2013";

    CharBuffer charBuffer = CharBuffer.wrap(myString.toCharArray());
    CodePointBuffer codePointBuffer = CodePointBuffer.withChars(charBuffer);
    CodePointCharStream cpcs = CodePointCharStream.fromBuffer(codePointBuffer);

    OneLexer lexer = new MyLexer(cpcs);       
    CommonTokenStream tokens = new CommonTokenStream(lexer);
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!