问题
I'm trying to parse SMILES strings using the OpenSMILES specification.
The grammar:
grammar SMILES;
atom: bracket_atom | aliphatic_organic | aromatic_organic | '*';
aliphatic_organic: 'B' | 'C' | 'N' | 'O' | 'S' | 'P' | 'F' | 'Cl' | 'Br' | 'I';
aromatic_organic: 'b' | 'c' | 'n' | 'o' | 's' | 'p';
bracket_atom: '[' isotope? symbol chiral? hcount? charge? atom_class? ']';
symbol: element_symbols | aromatic_symbols | '*';
isotope: NUMBER;
element_symbols: UPPER_CASE_CHAR LOWER_CASE_CHAR?;
aromatic_symbols: 'c' | 'n' | 'o' | 'p' | 's' | 'se' | 'as';
chiral: '@'
| '@@'
| '@TH1' | '@TH2'
| '@AL1' | '@AL2'
| '@SP1' | '@SP2' | '@SP3'
| '@TB1' | '@TB2' | '@TB3' | DOT DOT DOT | '@TB29' | '@TB30'
| '@OH1' | '@OH2' | '@OH3' | DOT DOT DOT | '@OH29' | '@OH30';
hcount: 'H' | 'H' DIGIT;
charge: '-'
| '-' DIGIT
| '+'
| '+' DIGIT
| '--'
| '++';
atom_class:':' NUMBER;
bond: '-' | '=' | '#' | '$' | ':' | '/' | '\\';
ringbond: (bond? DIGIT | bond? '%' DIGIT DIGIT);
branched_atom: atom ringbond* branch*?;
branch: '(' chain ')' | '(' bond chain ')' | '(' dot chain ')';
chain: branched_atom
| chain branched_atom
| chain bond branched_atom
| chain dot branched_atom;
dot: '.';
DOT: .;
DIGIT: [0-9];
NUMBER: DIGIT+;
UPPER_CASE_CHAR: [A-Z];
LOWER_CASE_CHAR: [a-z];
ONE_TO_NINE: [1-9];
smiles: chain;
WS: [ \t\n\r]+ -> skip ;
When trying to parse the following using AntlrWorks2's TestRig:
CCc(c1)ccc2[n+]1ccc3c2Nc4c3cccc4
The following error(s) are printed (shortened for brevity):
line 1:5 extraneous input '1' expecting {'*', '[', 'N', 'O', 'I', 'S', '%', ')',..., DIGIT}
...
line 1:31 extraneous input '4' expecting {<EOF>, '*', '[', 'N', 'O',..., DIGIT}
This happens for every digit that is encountered in the string.
EDIT 1
After fixing the DOT rule, as suggested by @Lucas Trzesniewski, the extraneous input error has disappeared. However, a new error is present now when testing a different SMILES string.
For example, testing:
[Cu+2].[O-]S(=O)(=O)[O-]
Produces the error:
line 1:1 no viable alternative at input 'C'
EDIT 2
Problem from EDIT 1 was due to my element_symbols rule. Using the literal symbol strings seems to have solved it.
element_symbols: 'H' | 'He' | 'Li' | 'Be' | 'B' | 'C' | 'N' | 'O' | 'F' | 'Ne' | //...and so on
回答1:
Your lexer rules are wrong.
First error:
DOT: .;
This is a catch-all. What you really mean is:
DOT: '.';
Second error: You're getting confused with the following rules:
DIGIT: [0-9];
NUMBER: DIGIT+;
ONE_TO_NINE: [1-9];
ONE_TO_NINE will never match anything, because it's included in DIGIT, and DIGIT appears first. As the ONE_TO_NINE rule is never used, so you should simply remove it.
Then, things like DIGIT DIGIT in your parser rules won't match either if you're expecting a 2-digit number, you'll get a NUMBER token there unless you separate the digits with whitespace (I don't know what you really mean there so perhaps it's not an error).
来源:https://stackoverflow.com/questions/26206148/antlr4-is-printing-extraneous-input-error-even-with-expected-input