Antlr4 is printing 'Extraneous input' error even with expected input

穿精又带淫゛_ 提交于 2019-12-13 05:39:15

问题


I'm trying to parse SMILES strings using the OpenSMILES specification.

The grammar:

grammar SMILES;

atom: bracket_atom | aliphatic_organic | aromatic_organic | '*';

aliphatic_organic: 'B' | 'C' | 'N' | 'O' | 'S' | 'P' | 'F' | 'Cl' | 'Br' | 'I';
aromatic_organic: 'b' | 'c' | 'n' | 'o' | 's' | 'p';

bracket_atom: '[' isotope? symbol chiral? hcount? charge? atom_class? ']';
symbol: element_symbols | aromatic_symbols | '*';
isotope: NUMBER;
element_symbols: UPPER_CASE_CHAR LOWER_CASE_CHAR?;
aromatic_symbols: 'c' | 'n' | 'o' | 'p' | 's' | 'se' | 'as';

chiral: '@'
        |  '@@'
        |  '@TH1' | '@TH2'
        |  '@AL1' | '@AL2'
        |  '@SP1' | '@SP2' | '@SP3'
        |  '@TB1' | '@TB2' | '@TB3' | DOT DOT DOT | '@TB29' | '@TB30'
        |  '@OH1' | '@OH2' | '@OH3' | DOT DOT DOT | '@OH29' | '@OH30';

hcount: 'H' |  'H' DIGIT;

charge: '-'
        |  '-' DIGIT
        |  '+'
        |  '+' DIGIT
        |  '--'
        |  '++';

atom_class:':' NUMBER;

bond: '-' | '=' | '#' | '$' | ':' | '/' | '\\';
ringbond: (bond? DIGIT |  bond? '%' DIGIT DIGIT);
branched_atom: atom ringbond* branch*?;
branch: '(' chain ')' |  '(' bond chain ')' |  '(' dot chain ')';
chain: branched_atom
    |  chain branched_atom
    |  chain bond branched_atom
    |  chain dot branched_atom;
dot: '.';

DOT: .;
DIGIT: [0-9];
NUMBER: DIGIT+;
UPPER_CASE_CHAR: [A-Z];
LOWER_CASE_CHAR: [a-z];

ONE_TO_NINE: [1-9];

smiles: chain;

WS: [ \t\n\r]+ -> skip ;

When trying to parse the following using AntlrWorks2's TestRig:

CCc(c1)ccc2[n+]1ccc3c2Nc4c3cccc4

The following error(s) are printed (shortened for brevity):

line 1:5 extraneous input '1' expecting {'*', '[', 'N', 'O', 'I', 'S', '%', ')',..., DIGIT}
...
line 1:31 extraneous input '4' expecting {<EOF>, '*', '[', 'N', 'O',..., DIGIT}

This happens for every digit that is encountered in the string.

EDIT 1

After fixing the DOT rule, as suggested by @Lucas Trzesniewski, the extraneous input error has disappeared. However, a new error is present now when testing a different SMILES string.

For example, testing:

[Cu+2].[O-]S(=O)(=O)[O-]

Produces the error:

line 1:1 no viable alternative at input 'C'

EDIT 2

Problem from EDIT 1 was due to my element_symbols rule. Using the literal symbol strings seems to have solved it.

element_symbols: 'H' | 'He' | 'Li' | 'Be' | 'B' | 'C' | 'N' | 'O' | 'F' | 'Ne' | //...and so on

回答1:


Your lexer rules are wrong.

First error:

DOT: .;

This is a catch-all. What you really mean is:

DOT: '.';

Second error: You're getting confused with the following rules:

DIGIT: [0-9];
NUMBER: DIGIT+;
ONE_TO_NINE: [1-9];

ONE_TO_NINE will never match anything, because it's included in DIGIT, and DIGIT appears first. As the ONE_TO_NINE rule is never used, so you should simply remove it.

Then, things like DIGIT DIGIT in your parser rules won't match either if you're expecting a 2-digit number, you'll get a NUMBER token there unless you separate the digits with whitespace (I don't know what you really mean there so perhaps it's not an error).



来源:https://stackoverflow.com/questions/26206148/antlr4-is-printing-extraneous-input-error-even-with-expected-input

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!