Antlr4: How can I both hide and use Tokens in a grammar

I'm parsing a script language that defines two types of statements; control statements and non control statements. Non control statements are always ended with ';', while control statements may end with ';' or EOL ('\n'). A part of the grammar looks like this:

script
    :   statement* EOF
    ;

statement
    :   control_statement
    |   no_control_statement
    ;

control_statement
    :   if_then_control_statement
    ;

if_then_control_statement
    :   IF expression THEN end_control_statment
        ( statement ) *
        ( ELSEIF expression THEN end_control_statment ( statement )* )*
        ( ELSE end_control_statment ( statement )* )?
        END IF end_control_statment
    ;

no_control_statement
    :   sleep_statement
    ;

sleep_statement
    :   SLEEP expression END_STATEMENT
    ;

end_control_statment
    :   END_STATEMENT
    |   EOL
    ;

END_STATEMENT
    :   ';'
    ;

ANY_SPACE
    :   ( LINE_SPACE | EOL )    ->  channel(HIDDEN)
    ;

EOL
    :   [\n\r]+
    ;

LINE_SPACE
    :   [ \t]+
    ;

In all other aspects of the script language, I never care about EOL so I use the normal lexer rules to hide white space.

This works fine in all cases but the cases where I need to use a EOL to find a termination of a control statement, but with the grammar above, all EOL is hidden and not used in the control statement rules.

Is there a way to change my grammar so that I can skip all EOL but the ones needed to terminate parts of my control statements?

Found one way to handle this.

The idea is to divert EOL into one hidden channel and the other stuff I don´t want to see in another hidden channel (like spaces and comments). Then I use some code to backtrack the tokens when an EOL is supposed to show up and examine the previous tokens channels (since they already have been consumed). If I find something on EOL channel before I run into something from the ordinary channel, then it is ok.

It looks like this:

Changed the lexer rules:

@lexer::members {
    public static int EOL_CHANNEL = 1;
    public static int OTHER_CHANNEL = 2;
}

...

EOL
  : '\r'? '\n'  ->  channel(EOL_CHANNEL)
  ;

LINE_SPACE
  : [ \t]+  ->  channel(OTHER_CHANNEL)
  ;

I also diverted all other HIDDEN channels (comments) to the OTHER_CHANNEL. Then I changed the rule end_control_statment:

end_control_statment
  : END_STATEMENT
  | { isEOLPrevious() }?
  ;

and added

@parser::members {
  public static int EOL_CHANNEL = 1;
  public static int OTHER_CHANNEL = 2;

  boolean isEOLPrevious()
  {
        int idx = getCurrentToken().getTokenIndex();
        int ch;

        do
        {
            ch = getTokenStream().get(--idx).getChannel();
        }
        while (ch == OTHER_CHANNEL);

        // Channel 1 is only carrying EOL, no need to check token itself
        return (ch == EOL_CHANNEL);
     }
}

One could stick to the ordinary hidden channel but then there is a need to both track channel and tokens while backtracking so this is maybe a bit easier...

Hope this could help someone else dealing with these kind of issues...

来源：https://stackoverflow.com/questions/41667217/antlr4-how-can-i-both-hide-and-use-tokens-in-a-grammar

标签

parsing

whitespace

antlr4