ANTLR grammar for reStructuredText (rule priorities)

后端未结

关注

 2  1781

旧巷少年郎 2020-12-30 17:10

First question stream

Hello everyone,

This could be a follow-up on this question: Antlr rule priorities

I\'m trying to write an ANTLR grammar for t

2条回答

悲哀的现实 (楼主)

2020-12-30 17:58
Robin wrote:

I thought that writing rules for inline markup text would be easy

I must admit that I am not familiar with this markup language, but it seems to resemble BB-Code or Wiki markup which are not easily translated into a (ANTLR) grammar! These languages don't let themselves be easily tokenized since it depends on where these tokens occur. White spaces sometimes have a special meaning (with definition lists). So no, it's not at all easy, IMO. So if this is just an exercise for you to get acquainted to ANTLR (or parser generators in general), I highly recommend choosing something else to parse.

Robin wrote:

Could someone point to my errors and maybe give me a hint on how to match regular text?

You must first realize that ANTLR creates a lexer (tokenizer) and parser. Lexer rules start with a upper case letter and parser rules start with a lower case. A parser can only operate on tokens (the objects that are made by lexer rules). To keep things orderly, you should not use token-literals inside parser rules (see rule q in the grammar below). Also, the ~ (negation) meta char has a different meaning depending on where it's used (in a parser- or lexer rule).

Take the following grammar:
```
p : T;
q : ~'z';

T : ~'x';
U : 'y';
```
ANTLR will first "move" the 'z' literal to a lexer rule like this:
```
p : T;
q : ~RANDOM_NAME;

T : ~'x';
U : 'y';
RANDOM_NAME : 'z';
```
(the name RANDOM_NAME is not used, but that doesn't matter). Now, the parser rule q does not match any character other than 'z'! A negation inside a parser rule negates a token (or lexer rule). So ~RANDOM_NAME will match either lexer rule T or lexer rule U.

Inside lexer rules, ~ negates (single!) characters. So the lexer rule T will match any character in the range \u0000..\uFFFF except 'x'. Note that the following: ~'ab' is invalid inside a lexer rule: you can only negate single character sets.

So, all these ~'???' inside your parser rules are wrong (wrong as in: they don't behave as you expect them to).

Robin wrote:

Is there a way to set priority on the grammar rules? Maybe this could be a lead.

Yes, the order is top to bottom in both lexer- and parser rules (where the top has the highest priority). Let's say parse is the entry point of your grammar:
```
parse
  :  p
  |  q
  ;
```
then p will first be tried, and if that fails, q is tried to match.

As for lexer rules, the rules that are keywords for example are matched before a rule that could possible match said keywords:
```
// first keywords:
WHILE : 'while';
IF    : 'if'
ELSE  : 'else';

// and only then, the identifier rule: 
ID    : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...