Can ANTLR4 java parser handle very large files or can it stream files

南楼画角 提交于 2019-11-28 05:48:06

问题


Is the java parser generated by ANTLR capable of streaming arbitrarily large files?

I tried constructing a Lexer with a UnbufferedCharStream and passed that to the parser. I got an UnsupportedOperationException because of a call to size on the UnbufferedCharStream and the exception contained an explained that you can't call size on an UnbufferedCharStream.

    new Lexer(new UnbufferedCharStream( new CharArrayReader("".toCharArray())));
    CommonTokenStream stream = new CommonTokenStream(lexer);
    Parser parser = new Parser(stream);

I basically have a file I exported from hadoop using pig. It has a large number of rows separated by '\n'. Each column is split by a '\t'. This is easy to parse in java as I use a buffered reader to read each line. Then I split by '\t' to get each column. But I also want to have some sort of schema validation. The first column should be a properly formatted date, followed some price columns, followed by some hex columns.

When I look at the generated parser code I could call it like so

    parser.lines().line()

This would give me a List which conceptually I could iterate over. But it seems that the list would have a fixed size by the time I get it. Which means the parser probably already parsed the entire file.

Is there another part of the API that would allow you to stream really large files? Like some way of using the Visitor or Listener to get called as it is reading the file? But it can't keep the entire file in memory. It will not fit.


回答1:


You could do it like this:

InputStream is = new FileInputStream(inputFile);//input file is the path to your input file
ANTLRInputStream input = new ANTLRInputStream(is);
GeneratedLexer lex = new GeneratedLexer(input);
lex.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lex);
GeneratedParser parser = new GeneratedParser(tokens);
parser.setBuildParseTree(false);//!!
parser.top_level_rule();

And if the file is quite big, forget about listener or visitor - I would be creating object directly in the grammar. Just put them all in some structure (i.e. HashMap, Vector...) and retrieve as needed. This way creating the parse tree (and this is what really takes a lot of memory) is avoided.



来源:https://stackoverflow.com/questions/17500291/can-antlr4-java-parser-handle-very-large-files-or-can-it-stream-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!