Parse arbitrary delimiter character using Antlr4

问题

I try to create a grammar in Antlr4 that accepts regular expressions delimited by an arbitrary character (similar as in Perl). How can I achieve this?

To be clear: My problem is not the regular expression itself (which I actually do not handle in Antlr, but in the visitor), but the delimiter characters. I can easily define the following rules to the lexer:

REGEXP: '/' (ESC_SEQ | ~('\\' | '/'))+ '/' ;
fragment ESC_SEQ: '\\' . ;

This will use the forward slash as the delimiter (like it is commonly used in Perl). However, I also want to be able to write a regular expression as m~regexp~ (which is also possible in Perl).

If I had to solve this using a regular expression itself, I would use a backreference like this:

m(.)(.+?)\1

(which is an "m", followed by an arbitrary character, followed by the expression, followed by the same arbitrary character). But backreferences seem not to be available in Antlr4.

It would be even better when I could use pairs of brackets, i.e. m(regexp) or m{regexp}. But since the number of possible bracket types is quite small, this could be solved by simply enumerating all different variants.

Can this be solved with Antlr4?

回答1:

You could do something like this:

lexer grammar TLexer;

REGEX
 : REGEX_DELIMITER ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ {getText().charAt(0) == _input.LA(1)}? .
 | '{' REGEX_ATOM+ '}'
 | '(' REGEX_ATOM+ ')'
 ;

ANY
 : .
 ;

fragment REGEX_DELIMITER
 : [/~@#]
 ;

fragment REGEX_ATOM
 : '\\' .
 | ~[\\]
 ;

If you run the following class:

public class Main {

  public static void main(String[] args) throws Exception {

    TLexer lexer = new TLexer(new ANTLRInputStream("/foo/ /bar\\ ~\\~~ {mu} (bla("));

    for (Token t : lexer.getAllTokens()) {
      System.out.printf("%-20s %s\n", TLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText().replace("\n", "\\n"));
    }
  }
}

you will see the following output:

REGEX                /foo/
ANY                   
ANY                  /
ANY                  b
ANY                  a
ANY                  r
ANY                  \
ANY                   
REGEX                ~\~~
ANY                   
REGEX                {mu}
ANY                   
ANY                  (
ANY                  b
ANY                  l
ANY                  a
ANY                  (

The {...}? is called a predicate:

Syntax of semantic predicates in Antlr4
Semantic predicates in ANTLR4?

The ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ part tells the lexer to continue matching characters as long as the character matched by REGEX_DELIMITER is not ahead in the character stream. And {getText().charAt(0) == _input.LA(1)}? . makes sure there actually is a closing delimiter matched by the first chararcter (which is a REGEX_DELIMITER, of course).

Tested with ANTLR 4.5.3

EDIT

And to get a delimiter preceded by m + some optional spaces to work, you could try something like this (untested!):

lexer grammar TLexer;

  @lexer::members {
    boolean delimiterAhead(String start) {
      return start.replaceAll("^m[ \t]*", "").charAt(0) == _input.LA(1);
    }
  }

  REGEX
   : '/' ( '\\' . | ~[/\\] )+ '/'
   | 'm' SPACES? REGEX_DELIMITER ( {!delimiterAhead(getText())}? ( '\\' . | ~[\\] ) )+ {delimiterAhead(getText())}? .
   | 'm' SPACES? '{' ( '\\' . | ~'}' )+ '}'
   | 'm' SPACES? '(' ( '\\' . | ~')' )+ ')'
   ;

  ANY
   : .
   ;

  fragment REGEX_DELIMITER
   : [~@#]
   ;

  fragment SPACES
   : [ \t]+
   ;

来源：https://stackoverflow.com/questions/38218744/parse-arbitrary-delimiter-character-using-antlr4

标签

regex

antlr4