Serialization of ANTLR ParseTree

问题

I have a generated grammar that does two things:

Check the syntax of a domain specific language
Evaluate input against that domain specific language

These two functions are separate, lets call them validate() and evaluate().

The validate() function builds the tree from a String input while ensuring it meets the requirements of the BNF for the language. The evaluate() function plugs in values to that tree to get a result (usually true or false).

What the code is currently doing is running validate() each time on the input, just to generate the tree that evaluate() uses. Some of the inputs take up to 60 seconds to be checked. What I would LIKE to do is serialize the results of validate() (assuming it meets the syntax requirements), store the serialized form in the backend database, and just load it from the database as part of evaluate().

I noticed that I can execute the method toStringTree() on the parse tree, and retrieve a LISP style tree. However, can I restore a LISP style tree to an ANTLR parse tree? If not, can anyone recommend another way to serialize and store the generated parse tree?

Thanks for any help.

Jason

回答1:

ANTLR 4's ParseRuleContext data structure (the specific implementation of ParseTree used by generated parsers to represent grammar rules in the parse tree) is not serializable by default. Open issue #233 on the project issue tracker covers the feature request. However, based on my experience with many applications using ANTLR for parsing, I'm not convinced serializing the parse trees would be useful in the long run. For each problem serializing the parse tree is meant to address, a better solution already exists.

Another option is to store a hash of the last known valid file in the database. After you use the parser to create a parse tree, you could skip the validation step if the input file has the same hash as the last time it was validated. This leverages two aspects of ANTLR 4:

For the same input file, running the parser twice will produce the same parse tree.
The ANTLR 4 parser is extremely fast in almost all cases (e.g. the Java grammar can process around 20MB of source per second). The remaining cases tend to be caused by poorly structured grammar rules that the new parser interpreter feature in ANTLRWorks 2.2 can analyze and make suggestions for improvement.

If you need performance beyond what you get with this, then a parse tree isn't the data structure you should be using. StringTemplate 4's enormous performance advantage over StringTemplate 3 came primarily from the fact that the interpreter switched from using ASTs (equivalent to parse trees for this reasoning) to a linear bytecode representation/interpreter. The ASTs for ST4 would never need to be serialized for performance reasons because the bytecode would be serialized instead. In fact, the C# port of StringTemplate 4 provides exactly this feature.

回答2:

If the input data to your grammar is made of several independent blocks, you could try to store the string of each block separately, and run the parsing process again for each block independently, using a ThreadPool for example.

Say for example your input data is a set of method declarations:

int add(int a, int b) {
   return a+b;
}

int mul(int a, int b) {
    return a*b;
}

...

and the grammar is something like:

methodList : methodDeclaration methodList
           |
           ;
methodDeclaration : // your method declaration rules...

The first run of the parser just collects each method text and store it. The parser starts the process at the methodList rule.

void visitMethodList(MethodListContext ctx) {
    if(ctx.methodDeclaration() != null) {
        String methodStr = formatParseTree(ctx.methodDeclaration(), " ");
        // store methodStr for later parsing
    }

    // visit next method list item, if any
    if(ctx.methodList() != null) {
        visit(ctx.methodList());
    }
}

The second run launch the parsing of each method declaration (in a separate thread for example). For this, the parser starts at the methodDeclaration rule.

void visitMethodDeclaration(MethodDeclarationContext ctx) {
    // parse the method block
}

The reason why the text of a methodDeclaration rule is formatted if because calling directly ctx.methodDeclaration().getText() would combine the text of all child nodes AntLR doc, possibly making it unusable for parsing again. If white space is a token separator in the grammar, then adding one space between tokens should not change the parse tree.

String formatParseTree(ParseTree tree, String separator) {
    StringBuilder builder = new StringBuilder();
    for(int i = 0; i < tree.getChildCount(); i ++) {
        ParseTree child = tree.getChild(i);

        if(child instanceof TerminalNode) {
            builder.append(child.getText());
            builder.append(separator);
        } else if(child instanceof RuleContext) {
            builder.append(formatParseTree(child, separator));
        }
    }

    return builder.toString();
}

来源：https://stackoverflow.com/questions/22562061/serialization-of-antlr-parsetree

标签

antlr

antlr4