Hello everyone,
This could be a follow-up on this question: Antlr rule priorities
I\'m trying to write an ANTLR grammar for t
Robin wrote:
I thought that writing rules for inline markup text would be easy
I must admit that I am not familiar with this markup language, but it seems to resemble BB-Code or Wiki markup which are not easily translated into a (ANTLR) grammar! These languages don't let themselves be easily tokenized since it depends on where these tokens occur. White spaces sometimes have a special meaning (with definition lists). So no, it's not at all easy, IMO. So if this is just an exercise for you to get acquainted to ANTLR (or parser generators in general), I highly recommend choosing something else to parse.
Robin wrote:
Could someone point to my errors and maybe give me a hint on how to match regular text?
You must first realize that ANTLR creates a lexer (tokenizer) and parser. Lexer rules start with a upper case letter and parser rules start with a lower case. A parser can only operate on tokens (the objects that are made by lexer rules). To keep things orderly, you should not use token-literals inside parser rules (see rule q
in the grammar below). Also, the ~
(negation) meta char has a different meaning depending on where it's used (in a parser- or lexer rule).
Take the following grammar:
p : T;
q : ~'z';
T : ~'x';
U : 'y';
ANTLR will first "move" the 'z'
literal to a lexer rule like this:
p : T;
q : ~RANDOM_NAME;
T : ~'x';
U : 'y';
RANDOM_NAME : 'z';
(the name RANDOM_NAME
is not used, but that doesn't matter). Now, the parser rule q
does not match any character other than 'z'
! A negation inside a parser rule negates a token (or lexer rule). So ~RANDOM_NAME
will match either lexer rule T
or lexer rule U
.
Inside lexer rules, ~
negates (single!) characters. So the lexer rule T
will match any character in the range \u0000
..\uFFFF
except 'x'
. Note that the following: ~'ab'
is invalid inside a lexer rule: you can only negate single character sets.
So, all these ~'???'
inside your parser rules are wrong (wrong as in: they don't behave as you expect them to).
Robin wrote:
Is there a way to set priority on the grammar rules? Maybe this could be a lead.
Yes, the order is top to bottom in both lexer- and parser rules (where the top has the highest priority). Let's say parse
is the entry point of your grammar:
parse
: p
| q
;
then p
will first be tried, and if that fails, q
is tried to match.
As for lexer rules, the rules that are keywords for example are matched before a rule that could possible match said keywords:
// first keywords:
WHILE : 'while';
IF : 'if'
ELSE : 'else';
// and only then, the identifier rule:
ID : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
Here's a quick demo how you could parse this reStructeredText. Note that it just handles a minor set of all available markup-syntax, and by adding more to it, you will affect the existing parser/lexer rules: so there is much, much more work to be done!
grammar RST;
options {
output=AST;
backtrack=true;
memoize=true;
}
tokens {
ROOT;
PARAGRAPH;
INDENTATION;
LINE;
WORD;
BOLD;
ITALIC;
INTERPRETED_TEXT;
INLINE_LITERAL;
REFERENCE;
}
parse
: paragraph+ EOF -> ^(ROOT paragraph+)
;
paragraph
: line+ -> ^(PARAGRAPH line+)
| Space* LineBreak -> /* omit line-breaks between paragraphs from AST */
;
line
: indentation text+ LineBreak -> ^(LINE text+)
;
indentation
: Space* -> ^(INDENTATION Space*)
;
text
: styledText
| interpretedText
| inlineLiteral
| reference
| Space
| Star
| EscapeSequence
| Any
;
styledText
: bold
| italic
;
bold
: Star Star boldAtom+ Star Star -> ^(BOLD boldAtom+)
;
italic
: Star italicAtom+ Star -> ^(ITALIC italicAtom+)
;
boldAtom
: ~(Star | LineBreak)
| italic
;
italicAtom
: ~(Star | LineBreak)
| bold
;
interpretedText
: BackTick interpretedTextAtoms BackTick -> ^(INTERPRETED_TEXT interpretedTextAtoms)
;
interpretedTextAtoms
: ~BackTick+
;
inlineLiteral
: BackTick BackTick inlineLiteralAtoms BackTick BackTick -> ^(INLINE_LITERAL inlineLiteralAtoms)
;
inlineLiteralAtoms
: inlineLiteralAtom+
;
inlineLiteralAtom
: ~BackTick
| BackTick ~BackTick
;
reference
: Any+ UnderScore -> ^(REFERENCE Any+)
;
UnderScore
: '_'
;
BackTick
: '`'
;
Star
: '*'
;
Space
: ' '
| '\t'
;
EscapeSequence
: '\\' ('\\' | '*')
;
LineBreak
: '\r'? '\n'
| '\r'
;
Any
: .
;
When you generate a parser and lexer from the above, and let it parse the following input file:
***x*** **yyy** *zz* * a b c P2 ``*a*`b`` `q` Python_
(note the trailing line break!)
the parser will produce the following AST:
The graph can be created by running this class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"***x*** **yyy** *zz* *\n" +
"a b c\n" +
"\n" +
"P2 ``*a*`b`` `q`\n" +
"Python_\n";
RSTLexer lexer = new RSTLexer(new ANTLRStringStream(source));
RSTParser parser = new RSTParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
or if your source comes from a file, do:
RSTLexer lexer = new RSTLexer(new ANTLRFileStream("test.rst"));
or
RSTLexer lexer = new RSTLexer(new ANTLRFileStream("test.rst", "???"));
where "???"
is the encoding of your file.
The class above will print the AST as a DOT file to the console. You can use a DOT viewer to display the AST. In this case, I posted an image created by kgraphviewer. But there are many more viewers around. A nice online one is this one, which appears to be using kgraphviewer under "the hood". Good luck!