(F) Lex, how do I match negation?

谁都会走 提交于 2019-12-13 02:06:29

问题


Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:

~('\'|'"'|'$'|NEWLINE)

Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.


回答1:


(TL;DR: Skip down to the bottom for a practical answer.)

The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.

The """ case, at least, is not too difficult.

First, let's be clear about what we are trying to match.

Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".

So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:

"""This string might contain " or ""."""

If we start after the initial """ and look for the longest string which doesn't contain """, we will find:

This string might contain " or "".""

whereas what we wanted was:

This string might contain " or "".

So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)

It's (relatively) easy to produce a state diagram for that:

(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)

Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:

([^"]|\"([^"]|\"[^"]))*

This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:

([^E]|E([^N]|N[^D]))*   <--- DON'T USE THIS

but that regular expression will match the string

ENENDstuff which shouldn't have been matched

The real state diagram we're looking for is

and one way of writing that as a regular expression is:

([^E]|E(E|NE)*([^EN]|N[^ED]))

Again, I produced that by tracing all the ways to end up in state 0:

[^E] stays in state 0
E    in state 1:
     (E|NE)*: stay in state 1
     [^EN]: back to state 0
     N[^ED]:back to state 0 via state 2

This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).


A practical and scalable solution

Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:

%x TRIPLEQ
start \"\"\"
end   \"\"\"
%%

{start}        { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }

<TRIPLEQ>.|\n  { /* Append the next token to yytext instead of
                  * replacing yytext with the next token
                  */
                 yymore();
                 /* No return yet, flex continues */
               }
<TRIPLEQ>{end} { /* We've found the end of the string, but
                  * we need to get rid of the terminating """
                  */
                 yylval.str = malloc(yyleng - 2);
                 memcpy(yylval.str, yytext, yyleng - 3);
                 yylval.str[yyleng - 3] = 0;
                 return STRING;
               }

This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.

This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions

start "<![CDATA["
end   "]]>"

(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)



来源:https://stackoverflow.com/questions/25960801/f-lex-how-do-i-match-negation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!