Flex RegEx not getting matched?

问题

I've been working with Flex/Bison for about 6 hours now, and here is the first problem I don't seam to be able to solve:

I have the following file...

 state state1: {
     1-3: 255
     4: 255
 }

...which I pass to my flex/bison program using cat and |. The flex file contains this line:

\bstate\b  { return STATE; }

and further down this one:

.*         { fprintf(stderr, "Lexer error on line %d: \"%s\"\n", linenum, yytext); exit(-1); }

One should think that \bstate\b should get matched in the file, but it doesn't. Instead I get the following output:

"exer error on line 1: "state state1: {

This is strange in several ways. Firstly, the L in Lexer seams to have been replaced by an ", but more importantly, state didn't get matched. Why???

Of course the \bstate\b is before the .*, and they are in the right section.

Thanks for your help, Jan

回答1:

(F)Lex does not search the input for a match. It tries all the patterns at the current input position, and selects the one which matches the most text, or the earliest one if more than one matches the same amount of text. The next lex match will start where the previous one ended.

.* matches the rest of the line. \bstate\b would only match seven characters. So .* would win. But \bstate\b does not actually match because this is lex, not <insert your favourite regex syntax here> and \b means backspace, like it would in a C program.

The reason the letter L is overwritten with a quote is probably that your input file was created on Windows and has \r\n at the end of lines. .* will match up to an including the \r, which is a carriage return. So when you printf "%s"\n, the last character in the string which replaces %s is a carriage return, which causes the cursor to move to the first spot in the current line, which up to that point in time had an L in it. Then the " is printed over top of the L, and then finally you print the newline character, which starts a new line.

There is no Lex equivalent to the word-boundary assertion \b but that's very rarely a problem. Lexical scanners for practically all programming languages have to cope with the issue that reserved words will also match the pattern for identifiers; however, the combination of the longest-match and first-match rules makes it easy to do this. Put simply, always put reserved word patterns first. For example:

do              { return DO; }
double          { return DOUBLE; }
if              { return IF; }
/* ... */
[a-z][a-z0-9]*  { return ID; }

The order in which you put do and double doesn't matter in the above example, because double is longer, but I always feel like you should put reserved words in alphabetical order for tidiness. But it is important that the ID pattern go last, because it also matches all of the reserved words.

Now consider what happens when lexing an identifier that starts with a reserved word, like dog. In this case, the DO pattern and the ID pattern will both match, but the ID match is longer so it wins, despite being later.

来源：https://stackoverflow.com/questions/12827373/flex-regex-not-getting-matched

标签

regex

bison

flex-lexer