Differentiating division from regex when lexing gawk code

问题

I am writing a flex parser for gawk scripts. I am running into a problem differentiating between uses for a forward slash (/) character.

Obviously, a single / would be an operator for division, but two slashes could be both a regular expression or division. Right now, it parses

int((r-1)/3)*3+int((c-1)/3)+1

as having the regular expression

/3)*3+int((c-1)/

instead of the intended division operations. How do I get flex to recognize it as a mathematical expression?

Right now, this is my flex regular expression to recognize regular expressions in gawk:

EXT_REG_EXP "\/"("\\\/"|[^\/\n])*"\/"

and the division operator should be caught by my list of operators:

OPERATOR "+"|"-"|"*"|"/"|"%"|"^"|"!"|">"|"<"|"|"|"?"|":"|"~"|"$"|"="

But since the flex regular expressions are greedy I guess it treats two divisions as a regular expression.

回答1:

I don't think it's possible to define a simple token expression to unambiguously identify regular expressions. The Posix spec for Awk notes the ambiguity thusly:

In some contexts, a slash ( '/' ) that is used to surround an ERE could also be the division operator. This shall be resolved in such a way that wherever the division operator could appear, a slash is assumed to be the division operator. (There is no unary division operator.)

And later:

There is a lexical ambiguity between the token ERE and the tokens '/' and DIV_ASSIGN. When an input sequence begins with a slash character in any syntactic context where the token '/' or DIV_ASSIGN could appear as the next token in a valid program, the longer of those two tokens that can be recognized shall be recognized. In any other syntactic context where the token ERE could appear as the next token in a valid program, the token ERE shall be recognized.

("ERE" stands for "extended regular expression.") From this, I think you can safely conclude that a tokenizer for Awk has to be aware of the syntactic context, and hence there is no possible regular expression that could successfully identify regular expression tokens.

It's also worth looking at how Awk itself (or at least one of the implementations) is defined to parse regexes. In the original Awk (sometimes called the One True Awk), identifying regular expressions is the job of the parser, which explicitly sets the lexer into "regex mode" when it has figured out that it should expect to read a regex:

reg_expr:
      '/' {startreg();} REGEXPR '/'     { $$ = $3; }
    ;

(startreg() is a function defined in lex.c.) The reg_expr rule itself is only ever matched in contexts where a division operator would be invalid.

Sorry to disappoint, but I hope this helps nonetheless.

来源：https://stackoverflow.com/questions/12665213/differentiating-division-from-regex-when-lexing-gawk-code

标签

c++

regex

awk

lex