ANTLR Nested Functions | 易学教程

问题

Is ANTLR right for this project?

I'm looking to process and transform a string entered in by a user which may include custom functions. For example, the user might write something like $CAPITALIZE('word') in a string and I want to perform the actual transformation in the background using StringUtils.

I would imagine the users will sometimes write nested functions like:

$RIGHT_PAD($RIGHT($CAPITALIZE('a123456789'),6),3,'0')

Where the expected output would be a string value of 'A12345000'.

I tried using regex to split the functions apart, but once nested, it wasn't so easy. I figured I might try writing my own parser, and while doing research I came across an article that suggested using ANTLR instead.

Is this something ANTLR would be right for? If so, are there any similar examples already available for me to look at? Or would someone be kind enough to give me an example of how I might write this out in ANTLR so that I can have both custom functions that can be processable individually and in a nested fashion.

Functions:

$CAPITALIZE(String str)
$INDEX_OF(String seq, String searchSeq)
$LEFT(String str, int len)
$LEFT_PAD(String str, int size,char padChar)
$LOWERCASE(String str)
$RIGHT(String str, int len)
$RIGHT_PAD(String str, int size, char padChar)
$STRIP(String str)
$STRIP_ACCENTS(String input)
$SUBSTRING(String str, int start)
$SUBSTRING(String str, int start, int end)
$TRIM(String str)
$TRUNCATE(String str, int maxWidth)
$UPPERCASE(String str)

Basic Examples:

$CAPITALIZE('word') → 'Word'
$INDEX_OF('word', 'r') → 2
$LEFT('0123456789',6) → '012345'
$LEFT_PAD('0123456789',3, '0') → '0000123456789'
$LOWERCASE('WoRd') → 'word'
$RIGHT('0123456789',6) → '456789'
$RIGHT_PAD('0123456789',3, '0') → '0123456789000'
$STRIP(' word ') → 'word'
$STRIP_ACCENTS('wórd') → 'word'
$SUBSTRING('word', 1) → 'ord'
$SUBSTRING('word', 0, 2) → 'wor'
$TRIM('word ') → 'word'
$TRUNCATE('more words', 3) → 'more'
$UPPERCASE('word') → 'WORD'

Nested Examples

$LEFT_PAD($LEFT('123456789',6),3,'0') → '000123456'
$RIGHT_PAD($RIGHT($CAPITALIZE('a123456789'),6),3,'0') → 'A12345000'

Actual Example: What I mean by actual example is that this is what I expect a string value might look like. You will notice that there are variables written like ${var}. These variables will be replaced with actual string values using Apache Commons StringSubstitutor prior to passing the String into ANTLR (if it turns out I should use it)

Initial String Written By User \HomeDir\Students\$RIGHT(${graduation.year},2)\$LEFT_PAD($LEFT(${state.id},6),3,'0')

String After Being Processed By StringSubstitutor \HomeDir\Students\$RIGHT('2020',2)\$LEFT_PAD($LEFT('123456789',6),3,'0')

String After Being Processed By ANTLR (And my final output)

\HomeDir\Students\20\000123456

Does ANTLR seem like something I should use for this project, or would something else be better suited?

回答1:

Yes, ANTLR would be a good choice. Keep in mind that ANTLR only does the parsing for you, and provides you with a mechanism to traverse the generated parse tree. You will have to write code to evaluate the expressions.

In your case, your lexer would need to be triggered when it stumbles upon a '$' by pushing the lexical state as being "in-a-function-mode". And when it sees a ')', one such "in-a-function-mode" should be popped off the lexical stack.

Read all about lexical modes/stack on the ANTLR wiki: https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md

Here's a quick demo of how that could work for ANTLR4 (ANTLR3 doesn't support lexical modes):

file: TLexer.g4

lexer grammar TLexer;

TEXT
 : ~[$]
 ;

FUNCTION_START
 : '$' -> pushMode(IN_FUNCTION), skip
 ;

mode IN_FUNCTION;
  FUNTION_NESTED : '$' -> pushMode(IN_FUNCTION), skip;
  ID             : [a-zA-Z_]+;
  PAR_OPEN       : '(';
  PAR_CLOSE      : ')' -> popMode;
  NUMBER         : [0-9]+;
  STRING         : '\'' ( ~'\'' | '\'\'' )* '\'';
  COMMA          : ',';
  SPACE          : [ \t\r\n]-> skip;

file: TParser.g4

parser grammar TParser;

options {
  tokenVocab=TLexer;
}

parse
 : atom* EOF
 ;

atom
 : text
 | function
 ;

text
 : TEXT+
 ;

function
 : ID params
 ;

params
 : PAR_OPEN ( param ( COMMA param )* )? PAR_CLOSE
 ;

param
 : NUMBER
 | STRING
 | function
 ;

With the ANTLR4 plugin from IntelliJ, you can easily test the parse method from the parser and feed it the following input: foo $RIGHT_PAD($RIGHT($CAPITALIZE('a123456789'), 6), 3, '0') bar, which will generate the following image of the parse tree:

来源：https://stackoverflow.com/questions/51955458/antlr-nested-functions

标签

java

antlr4

antlr3