Catching (and keeping) all comments with ANTLR

前端未结

关注

 6  1193

I\'m writing a grammar in ANTLR that parses Java source files into ASTs for later analysis. Unlike other parsers (like JavaDoc) I\'m trying to keep all of the comments. This

相关标签:

6条回答

南旧

2020-12-10 19:58
Is there a way to make ANTLR automatically add any comments it finds to the AST?

No, you'll have to sprinkle your entire grammar with extra comments rules to account for all the valid places comments can occur:
```
...

if_stat
 : 'if' comments '(' comments expr comments ')' comments ...
 ;

...

comments
 : (SingleLineComment | MultiLineComment)*
 ;

SingleLineComment
 : '//' ~('\r' | '\n')*
 ;

MultiLineComment
 : '/*' .* '*/'
 ;
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
没有蜡笔的小新

2020-12-10 20:01
Section 12.1 in "The Definitive Antlr 4 Reference" shows how to get access to comments without having to sprinkle the comments rules throughout the grammar. In short you add this to the grammar file:
```
grammar Java;

@lexer::members {
    public static final int WHITESPACE = 1;
    public static final int COMMENTS = 2;
}
```
Then for your comments rules do this:
```
COMMENT
    : '/*' .*? '*/' -> channel(COMMENTS)
    ;

LINE_COMMENT
    : '//' ~[\r\n]* -> channel(COMMENTS)
    ;
```
Then in your code ask for the tokens through the getHiddenTokensToLeft/getHiddenTokensToRight and look at the 12.1 section in the book and you will see how to do this.
0 讨论(0)
发布评论:

提交评论
- 加载中...

小蘑菇

2020-12-10 20:09

first: direct all comments to a certain channel (only comments)

COMMENT
    : '/*' .*? '*/' -> channel(2)
    ;

LINE_COMMENT
    : '//' ~[\r\n]* -> channel(2)
    ;

second: print out all comments

      CommonTokenStream tokens = new CommonTokenStream(lexer);
      tokens.fill();
      for (int index = 0; index < tokens.size(); index++)
      {
         Token token = tokens.get(index);
         // substitute whatever parser you have
         if (token.getType() != Parser.WS) 
         {
            String out = "";
            // Comments will be printed as channel 2 (configured in .g4 grammar file)
            out += "Channel: " + token.getChannel();
            out += " Type: " + token.getType();
            out += " Hidden: ";
            List<Token> hiddenTokensToLeft = tokens.getHiddenTokensToLeft(index);
            for (int i = 0; hiddenTokensToLeft != null && i < hiddenTokensToLeft.size(); i++)
            {
               if (hiddenTokensToLeft.get(i).getType() != IDLParser.WS)
               {
                  out += "\n\t" + i + ":";
                  out += "\n\tChannel: " + hiddenTokensToLeft.get(i).getChannel() + "  Type: " + hiddenTokensToLeft.get(i).getType();
                  out += hiddenTokensToLeft.get(i).getText().replaceAll("\\s", "");
               }
            }
            out += token.getText().replaceAll("\\s", "");
            System.out.println(out);
         }
      }

0 讨论(0)

太阳男子

2020-12-10 20:17

The feature "island grammars" can also be used. See the the following section in the ANTLR4 book:

Island Grammars: Dealing with Different Formats in the Same File

0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2020-12-10 20:19
I did that on my lexer part :
```
WS  :   ( [ \t\r\n] | COMMENT) -> skip
;

fragment
COMMENT
: '/*'.*'*/' /*single comment*/
| '//'~('\r' | '\n')* /* multiple comment*/
;
```
Like that it will remove them automatically !
0 讨论(0)
发布评论:

提交评论
- 加载中...
Happy的楠姐

2020-12-10 20:19

For ANTLR v3:

The whitespace tokens are usually not processed by parser, but they are still captured on the HIDDEN channel.

If you use BufferedTokenStream, you can get to list of all tokens through it and do a postprocessing, adding them as needed.

0 讨论(0)
发布评论:

提交评论
- 加载中...