问题
I'm confused by the Java spec about how this code should be tokenized:
ArrayList<ArrayList<Integer>> i;
The spec says:
The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would.
As I understand it, applying the "longest match" rule would result in the tokens:
- ArrayList
- <
- ArrayList
- <
- Integer
- >>
- i
- ;
which would not parse. But of course this code is parsed just fine.
What is the correct specification for this case?
Does it mean that a correct lexer must be context-free? It doesn't seem possible with a regular lexer.
回答1:
Based on reading the code linked by @sm4, it looks like the strategy is:
tokenize the input normally. So
A<B<C>> i;would be tokenized asA, <, B, <, C, >>, i, ;-- 8 tokens, not 9.during hierarchical parsing, when working on parsing generics and a
>is needed, if the next token starts with>-->>,>>>,>=,>>=, or>>>=-- just knock the>off and push a shortened token back onto the token stream. Example: when the parser gets to>>, i, ;while working on the typeArguments rule, it successfully parses typeArguments, and the remaining token stream is now the slightly different>, i, ;, since the first>of>>was pulled off to match typeArguments.
So although tokenization does happen normally, some re-tokenization occurs in the hierarchical parsing phase, if necessary.
来源:https://stackoverflow.com/questions/16803185/are-s-in-type-parameters-tokenized-using-a-special-rule