Explanation and solution for JavaCC's warning “Regular expression choice : FOO can never be matched as : BAR”?

 ̄綄美尐妖づ 提交于 2020-01-03 15:50:09

问题


I am teaching myself to use JavaCC in a hobby project, and have a simple grammar to write a parser for. Part of the parser includes the following:

TOKEN : { < DIGIT : (["0"-"9"]) > }
TOKEN : { < INTEGER : (<DIGIT>)+ > }
TOKEN : { < INTEGER_PAIR : (<INTEGER>){2} > }
TOKEN : { < FLOAT : (<NEGATE>)? <INTEGER> | (<NEGATE>)? <INTEGER>  "." <INTEGER>  | (<NEGATE>)? <INTEGER> "." | (<NEGATE>)? "." <INTEGER> > } 
TOKEN : { < FLOAT_PAIR : (<FLOAT>){2} > }
TOKEN : { < NUMBER_PAIR : <FLOAT_PAIR> | <INTEGER_PAIR> > }
TOKEN : { < NEGATE : "-" > }

When compiling with JavaCC I get the output:

Warning: Regular Expression choice : FLOAT_PAIR can never be matched as : NUMBER_PAIR

Warning: Regular Expression choice : INTEGER_PAIR can never be matched as : NUMBER_PAIR

I'm sure this is a simple concept but I don't understand the warning, being a novice in both parser generation and regular expressions.

What does this warning mean (in as-novice-as-you-can-get terms)?


回答1:


I don't know JavaCC, but I am a compiler engineer.

The FLOAT_PAIR rule is ambiguous. Consider the following text:

0.0

This could be FLOAT 0 followed by FLOAT .0; or it could be FLOAT 0. followed by FLOAT 0; both resulting in FLOAT_PAIR. Or it could be a single FLOAT 0.0.

More importantly, though, you are using lexical analysis with composition in a way that is never likely to work. Consider this number:

12345

This could be parsed as INTEGER 12, INTEGER 345 resulting in an INTEGER_PAIR. Or it could be parsed as INTEGER 123, INTEGER 45, another INTEGER_PAIR. Or it could be INTEGER 12345, another token. The problem exists because you are not requiring white space between the lexical elements of the INTEGER_PAIR (or FLOAT_PAIR).

You should almost never try to handle pairs like this in the lexer. Instead, you should handle plain numbers (INTEGER and FLOAT) as tokens, and handle things like negation and pairing in the parser, where whitespace has been dealt with and stripped.

(For example, how are you going to process "----42"? This is a valid expression in most programming languages, which will correctly calculate multiple negations, but would not be handled by your lexer.)

Also, be aware that single-digit integers in your lexer will not be matched as INTEGER, they will come out as DIGIT. I don't know the correct syntax for JavaCC to fix that for you, though. What you want is to define DIGIT not as a token, but simply something you can use in the definitions of other tokens; alternatively, embed the definition of DIGIT ([0-9]) directly wherever you are using DIGIT in your rules.




回答2:


I haven't used JavaCC, but it is possible that NUMBER_PAIR is ambiguous.

I think the problem comes down to the fact that the same exact thing can be matched as both FLOAT_PAIR and INTEGER_PAIR since FLOAT can match an INTEGER.

But this is just a guess having never seen the JavaCC syntax :)




回答3:


It probably means that for every FLOAT_PAIR you'll just get a FLOAT_PAIR token, never a NUMBER_PAIR token. The FLOAT_PAIR rule already matches all the input and JavaCC will not try to find further matching rules. That would be my interpretation, but I don't know JavaCC, so take it with a grain of salt.

Maybe you can specify somehow that NUMBER_PAIR is the main production and that you don't want to get any other tokens as results.




回答4:


Thanks to Barry Kelly's answer, the solution I've come up with is:

    SKIP : { < #TO_SKIP : " " | "\t" > }
    TOKEN : { < #DIGIT : (["0"-"9"]) > }
    TOKEN : { < #DIGITS : (<DIGIT>)+ > }
    TOKEN : { < INTEGER : <DIGITS> > }
    TOKEN : { < INTEGER_PAIR : (<INTEGER>) (<TO_SKIP>)+ (<INTEGER>) > }
    TOKEN : { < FLOAT : (<NEGATE>)?<DIGITS>"."<DIGITS> | (<NEGATE>)?"."<DIGITS> > } 
    TOKEN : { < FLOAT_PAIR : (<FLOAT>) (<TO_SKIP>)+ (<FLOAT>) > }
    TOKEN : { < #NUMBER : <FLOAT> | <INTEGER> > }
    TOKEN : { < NUMBER_PAIR : (<NUMBER>) (<TO_SKIP>)+ (<NUMBER>) >}
    TOKEN : { < NEGATE : "-" > }

I had completely forgot to include the space which is used to separate the two tokens, I've also used the '#' symbol which stops the tokens being matched, and is just used in the definition of other tokens. The above is compiled by JavaCC without warning or error.

However, as noted by Barry, there are reasons against doing this.



来源:https://stackoverflow.com/questions/791591/explanation-and-solution-for-javaccs-warning-regular-expression-choice-foo-c

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!