问题
I am writing a little parser for expressions. At the moment I just want it to recognize binary multiplications (myId * myId) and C-like dereferenced pointers (*myId), plus some assignation statements (myId *= myId).
The input that makes the parser throw errors is:
x *= y;
... on which the parser fails with this message and parse tree:
[line 1:1 mismatched input ' *' expecting {';', NEWLINE}]
(sourceFile (statement (expressionStatement (expression (monoOperatedExpression (atomicExpression x)))) * = ) (statement (expressionStatement (expression (monoOperatedExpression (atomicExpression y)))) ;) <EOF>)
I've been scratching my head for a while but I can't see what is wrong in my grammar (see it below). Any hints, please? Thanks in advance.
grammar Sable;
options {}
@header {
package org.sable.parser;
}
ASSIGNMENT_OP:
'='
;
BINARY_OP:
'*'
;
WS_BUT_NOT_NEWLINE:
WhiteSpaceButNotNewLineCharacter
;
NEWLINE:
('\u000D' '\u000A')
| '\u000A'
;
WSA_BINARY_OP:
(WS_BUT_NOT_NEWLINE+ BINARY_OP WS_BUT_NOT_NEWLINE+)
| BINARY_OP
;
WSA_PREFIX_OP:
(WS_BUT_NOT_NEWLINE+ '*' )
;
WS : WhiteSpaceCharacter+ -> skip
;
IDENTIFIER:
(IdentifierHead IdentifierCharacter*)
| ('`'(IdentifierHead IdentifierCharacter*)'`')
;
// NOTE: a file with zero statements is allowed because
// it can contain just comments.
sourceFile:
statement* EOF;
statement:
expressionStatement (';' | NEWLINE);
// Req. not existing any valid expression starting from
// an equals sign or any other assignment operator.
expressionStatement:
expression (assignmentOperator expression)?;
expression:
monoOperatedExpression (binaryOperator monoOperatedExpression)?
;
monoOperatedExpression:
atomicExpression
;
binaryOperator:
WSA_BINARY_OP
;
atomicExpression:
IDENTIFIER ('<' type (',' type)* '>')? //TODO: can this be a lsv?
;
type:
IDENTIFIER
;
assignmentOperator:
ASSIGNMENT_OP
;
fragment DecimalDigit:
'0'..'9'
;
fragment IdentifierHead:
'a'..'z'
| 'A'..'Z'
;
fragment IdentifierCharacter:
DecimalDigit
| IdentifierHead
;
fragment WhiteSpaceCharacter:
WhiteSpaceButNotNewLineCharacter
| NewLineCharacter;
fragment WhiteSpaceButNotNewLineCharacter:
[\u0020\u000C\u0009u000B\u000C]
;
fragment NewLineCharacter:
[\u000A\u000D]
;
EDIT: adding a new version of the grammar on request of commenters.
grammar Sable;
options {}
@header {
package org.sable.parser;
}
//
// PARSER RULES.
sourceFile : statement* EOF;
statement : expressionStatement (SEMICOLON | NEWLINE);
expressionStatement : expression (ASSIGNMENT_OPERATOR expression)?;
expression:
expression WSA_OPERATOR expression
| expression OPERATOR expression
| OPERATOR expression
| expression OPERATOR
| atomicExpression
;
atomicExpression:
IDENTIFIER ('<' type (',' type)* '>')? //TODO: can this be a lsv?
;
type : IDENTIFIER;
//
// LEXER RULES.
COMMENT : '/*' .*? '*/' -> channel(HIDDEN);
LINE_COMMENT : '//' ~[\000A\000D]* -> channel(HIDDEN);
ASSIGNMENT_OPERATOR : Operator? '=';
// WSA = White Space Aware token.
// These are tokens that occurr in a given whitespace context.
WSA_OPERATOR:
(WhiteSpaceNotNewline+ Operator WhiteSpaceNotNewline+)
;
OPERATOR : Operator;
// Newline chars are defined apart because they carry meaning as a statement
// delimiter.
NEWLINE:
('\u000D' '\u000A')
| '\u000A'
;
WS : WhiteSpaceNotNewline -> skip;
SEMICOLON : ';';
IDENTIFIER:
(IdentifierHead IdentifierCharacter*)
| ('`'(IdentifierHead IdentifierCharacter*)'`')
;
fragment DecimalDigit :'0'..'9';
fragment IdentifierHead:
'a'..'z'
| 'A'..'Z'
| '_'
| '\u00A8'
| '\u00AA'
| '\u00AD'
| '\u00AF' |
'\u00B2'..'\u00B5' |
'\u00B7'..'\u00BA' |
'\u00BC'..'\u00BE' |
'\u00C0'..'\u00D6' |
'\u00D8'..'\u00F6' |
'\u00F8'..'\u00FF' |
'\u0100'..'\u02FF' |
'\u0370'..'\u167F' |
'\u1681'..'\u180D' |
'\u180F'..'\u1DBF' |
'\u1E00'..'\u1FFF' |
'\u200B'..'\u200D' |
'\u202A'..'\u202E' |
'\u203F'..'\u2040' |
'\u2054' |
'\u2060'..'\u206F' |
'\u2070'..'\u20CF' |
'\u2100'..'\u218F' |
'\u2460'..'\u24FF' |
'\u2776'..'\u2793' |
'\u2C00'..'\u2DFF' |
'\u2E80'..'\u2FFF' |
'\u3004'..'\u3007' |
'\u3021'..'\u302F' |
'\u3031'..'\u303F' |
'\u3040'..'\uD7FF' |
'\uF900'..'\uFD3D' |
'\uFD40'..'\uFDCF' |
'\uFDF0'..'\uFE1F' |
'\uFE30'..'\uFE44' |
'\uFE47'..'\uFFFD'
;
fragment IdentifierCharacter:
DecimalDigit
| '\u0300'..'\u036F'
| '\u1DC0'..'\u1DFF'
| '\u20D0'..'\u20FF'
| '\uFE20'..'\uFE2F'
| IdentifierHead
;
// Non-newline whitespaces are defined apart because they carry meaning in
// certain contexts, e.g. within space-aware operators.
fragment WhiteSpaceNotNewline : [\u0020\u000C\u0009u000B\u000C];
fragment Operator:
'*'
| '/'
| '%'
| '+'
| '-'
| '<<'
| '>>'
| '&'
| '^'
| '|'
;
回答1:
The rule
expression
: monoOperatedExpression (binaryOperator monoOperatedExpression)?
;
does not permit an =
after the binaryOperator
. Accordingly, the runtime reports that it did not know what next rule to use following the consumption of the BINARY_OP
.
The grammar can be fixed with some significant restructuring and, preferably, simplification.
1 - Whitespace/newline handling can be greatly simplified by ignoring it.
WS : [ \t\r\n] -> skip;
C-family and Python-like languages are context free languages with a few, well-known context sensitive corner cases. ANTLR is a context free parser with a number of convenience capabilities to handle context sensitivities. So, ignoring (or hiding) whitespace should be the default.
2 - disambiguate the use of *
by definition:
STAR_EQUAL : '*=' ;
STAR : '*' ;
EQUAL : '=' ;
This ensures that any single STAR
is available to be considered only as a pointer mark or multiplication operator (the sequence STAR WS EQUAL is either invalid in your language or could have some custom meaning).
3 - use parser rule recursion: consider the expression handling rules in the C grammar, specifically starting with the expression
rule. The simplified pattern is:
expression // list of all valid syntaxes for an `expression`
: LPAREN expression RPAREN
| expression ( COMMA expression )*
| expression op expression
| unitary_op expression
| expression unitary_op
| << any other valid syntax >>
| atom
;
unitary_op : 2PLUS | 2DASH | .... ;
op : STAR_EQUAL | STAR | EQUAL | .... ;
atom
: STAR? IDENTIFIER // pointer usage
| NUMBER
;
Presented this way, the grammar will be far more readable and maintainable.
With these changes, completing the revision of the grammar is left as an easy exercise for the OP (meaning, try it and post any problems encountered).
Bonus - ANTLR is a top down parser. So, put the parser rules at the top, organized broad to narrow. Followed by the lexer rules, also ordered in the same way, any lexer modes, and then with fragment rules at the very bottom.
This ordering ease the cognitive load of understanding the grammar by you and others. For example, makes the tree dump easier/quicker to understand. Will also ease the task of eventually dividing into a split grammar (recommended if the grammar is of any significant complexity and required if there are modes).
Full Grammar
grammar Sable;
@header {
package org.sable.parser.gen;
}
sable
: statement* EOF
;
statement
: expression? SEMI
;
expression
: LPAREN expression RPAREN
| COMMA expression
| expression op expression
| unitary_op expression
| expression unitary_op
| STAR? IDENTIFIER
| NUMBER
;
unitary_op
: DPLUS | DMINUS
;
op : STAR_EQUAL | DIV_EQUAL | PLUS_EQUAL | MINUS_EQUAL | EQUAL
| STAR | DIV | PLUS | MINUS
;
COMMENT : Comment -> skip ;
STAR_EQUAL : '*=' ;
DIV_EQUAL : '/=' ;
PLUS_EQUAL : '+=' ;
MINUS_EQUAL : '-=' ;
EQUAL : '=' ;
STAR : '*' ; // mult or pointer
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
DPLUS : '++' ;
DMINUS : '--' ;
COMMA : ',' ;
DOT : '.' ;
SEMI : ';' ;
LPAREN : '(' ;
RPAREN : ')' ;
LBRACE : '{' ;
RBRACE : '}' ;
LBRACK : '[' ;
RBRACK : ']' ;
LANGLE : '<' ;
RANGLE : '>' ;
NUMBER : [0-9]+ ('.' [0-9]+)? ([eE] [+-]? [0-9]+)? ;
IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_-]* ;
WS : [ \t\r\n]+ -> skip;
ERRCHAR
: . -> channel(HIDDEN)
;
fragment Comment
: '/*' .*? '*/'
| '//' ~[\r\n]*
;
Generates but is untested. Report back any corner cases not handled.
来源:https://stackoverflow.com/questions/46353802/why-is-antlr4-parsing-my-sentence-into-two-statements