ANTLR behaviour with conflicting tokens

独自空忆成欢 提交于 2019-12-11 00:48:28

问题


How is ANTLR lexer behavior defined in the case of conflicting tokens? Let me explain what I mean by "conflicting" tokens. For example, assume that the following is defined:

INT_STAGE       :   '1'..'6';
INT             :   '0'..'9'+;

There is a conflict here, because after reading a sequence of digits, the lexer would not know whether there is one INT or many INT_STAGE tokens (or different combinations of both). After a test, it looks like that if INT is defined after INT_STAGE, the lexer would prefer to find INT_STAGE, but maybe not INT then? Otherwise, no INT_STAGE would ever be found.

Another example would be:

FOOL: ' fool'
FOO: 'foo'
ID              :   ('a'..'z'|'A'..'Z'|'_'|'%') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'%')*;

I was told that this is the "right" order to recognize all the tokens: while reading "fool" the lexer will find one FOOL token and not FOO ID or something else.


回答1:


The following logic applies:

  1. the lexer matches as much characters as possible
  2. if after applying rule 1, there are 2 or more rules that match the same amount of characters, the rule defined first will "win"

Taking this into account, the input "1", "2", ..., "6" is tokenized as an INT_STAGE: both INT_STAGE and INT match the same amount of characters, but INT_STAGE is defined first.

The input "12" is tokenized as a INT since it matches the most characters.

I was told that this is the "right" order to recognize all the tokens: while reading "fool" the lexer will find one FOOL token and not FOO ID or something else.

That is correct.



来源:https://stackoverflow.com/questions/34592107/antlr-behaviour-with-conflicting-tokens

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!