How to parse new line in Scala grammar with flex/bison?

问题

I want to parse Scala grammar with flex and bison. But I don't know how to parse the newline token in Scala grammar.

If I parse newline as a token T_NL, Here's the Toy.l for example:

...
[a-zA-Z_][a-zA-Z0-9_]*  { yylval->literal = strdup(yy_text); return T_ID; }
\n                      { yylval->token = T_LN; return T_LN; }
[ \t\v\f\r]             { /* skip whitespaces */ }
...

And here's the Toy.y for example:

function_def: 'def' T_ID '(' argument_list ')' return_expression '=' expression T_NL
            ;

argument_list: argument
             | argument ',' argument_list
             ;

expression: ...
          ;

return_expression: ...
                 ;

You could see that I have to skip T_NL in all other statements and definitions in Toy.y, which is really boring.

Please educate me with source code example!

回答1:

This is a clear case where bison push-parsers are useful. The basic idea is that the decision to send an NL token (or tokens) can only be made when the following token has been identified (and, in one corner case, the second following token).

The advantage of push parsers is that they let us implement strategies like this, where there is not necessarily a one-to-one relationship between input lexemes and tokens sent to the parser. I'm not going to deal with all the particularities of setting up a push parser (though it's not difficult); you should refer to the bison manual for details. [Note 1]

First, it's important to read the Scala language description with care. Newline processing is described in section 2.13:

A newline in a Scala source text is treated as the special token “nl” if the three following criteria are satisfied:

The token immediately preceding the newline can terminate a statement.

The token immediately following the newline can begin a statement.

The token appears in a region where newlines are enabled.

Rules 1 and 2 are simple lookup tables, which are precisely defined in the following two paragraphs. There is just one minor exception to rule 2 has a minor exception, described below:

A case token can begin a statement only if followed by a class or object token.

One hackish possibility to deal with that exception would be to add case[[:space:]]+class and case[[:space:]]+object as lexemes, on the assumption that no-one will put a comment between case and class. (Or you could use a more sophisticated pattern, which allows comments as well as whitespace.) If one of these lexemes is recognised, it could either be sent to the parser as a single (fused) token, or it could be sent as two tokens using two invocations of SEND in the lexer action. (Personally, I'd go with the fused token, since once the pair of tokens has been recognised, there is no advantage to splitting them up; afaik, there's no valid program in which case class can be parsed as anything other than case class. But I could be wrong.)

To apply rules one and two, then, we need two lookup tables indexed by token number: token_can_end_stmt and token_cannot_start_stmt. (The second one has its meaning reversed because most tokens can start statements; doing it this way simplifies initialisation.)

        /* YYNTOKENS is defined by bison if you specify %token-tables */
        static bool token_can_end_stmt[YYNTOKENS] = {
          [T_THIS] = true, [T_NULL] = true, /* ... */
        };
        static bool token_cannot_start_stmt[YYNTOKENS] = {
          [T_CATCH] = true, [T_ELSE] = true, /* ... */
        };

We're going to need a little bit of persistent state, but fortunately when we're using a push parser, the scanner does not need to return to its caller every time it recognises a token, so we can keep the persistent state as local variables in the scan loop. (That's another advantage of the push-parser architecture.)

From the above description, we can see that what we're going to need to maintain in the scanner state are:

some indication that a newline has been encountered. This needs to be a count, not a boolean, because we might need to send two newlines:

if two tokens are separated by at least one completely blank line (i.e a line which contains no printable characters), then two nl tokens are inserted. A simple way to handle this is to simply compare the current line number with the line number at the previous token. If they are the same, then there was no newline. If they differ by only one, then there was no blank line. If they differ by more than one, then there was either a blank line or a comment (or both). (It seems odd to me that a comment would not trigger the blank line rule, so I'm assuming that it does. But I could be wrong, which would require some adjustment to this scanner.) [Note 2]
the previous token (for rule 1). It's only necessary to record the token number, which is a simple small integer.
some way of telling whether we're in a "region where newlines are enabled" (for rule 3). I'm pretty sure that this will require assistance from the parser, so I've written it that way here.

By centralising the decision about sending a newline into a single function, we can avoid a lot of code duplication. My typical push-parser architecture uses a SEND macro anyway, to deal with the boilerplate of saving the semantic value, calling the parser, and checking for errors; it's easy to add the newline logic there:

// yylloc handling mostly omitted for simplicity
#define SEND_VALUE(token, tag, value) do {                             \
    yylval.tag = value;                                                \
    SEND(token);                                                       \
} while(0);

#define SEND(token) do {                                               \
    int status = YYPUSH_MORE;                                          \
    if (yylineno != prev_lineno)                                       \
            && token_can_end_stmt[prev_token]                          \
            && !token_cannot_start_stmt[token]                         \
            && in_new_line_region) {                                   \
      status = yypush_parse(ps, T_NL, NULL, &yylloc, &nl_enabled);     \
      if (status == YYPUSH_MORE                                        \
          && yylineno - prev_lineno > 1)                               \
        status = yypush_parse(ps, T_NL, NULL, &yylloc, &nl_enabled);   \
    }                                                                  \
    nl_encountered = 0;                                                \
    if (status == YYPUSH_MORE)                                         \
      status = yypush_parse(ps, token, &yylval, &yylloc, &nl_enabled); \
    if (status != YYPUSH_MORE) return status;                          \
    prev_token = token;                                                \
    prev_lineno = yylineno;                                            \
while (0);

Specifying the local state in the scanner is extremely simple; just place the declarations and initialisations at the top of your scanner rules, indented. Any indented code prior to the first rule is inserted directly into yylex, almost at the top of the function (so it is executed once per call, not once per lexeme):

%%
    int nl_encountered = 0;
    int prev_token = 0;
    int prev_lineno = 1;
    bool nl_enabled = true;
    YYSTYPE yylval;
    YYLTYPE yylloc = {0};

Now, the individual rules are pretty simple (except for case). For example, we might have rules like:

"while"                     { SEND(T_WHILE); }
[[:lower:]][[:alnum:]_]*    { SEND_VALUE(T_VARID, str, strdup(yytext)); }

That still leaves the question of how to determine if we are in a region where newlines are enabled.

Most of the rules could be handled in the lexer by just keeping a stack of different kinds of open parentheses, and checking the top of the stack: If the parenthesis on the top of the stack is a {, then newlines are enabled; otherwise, they are not. So we could use rules like:

[{([]   { paren_stack_push(yytext[0]); SEND(yytext[0]); }
[])}]   { paren_stack_pop(); SEND(yytext[0]); }

However, that doesn't deal with the requirement that newlines be disabled between a case and its corresponding =>. I don't think it's possible to handle that as another type of parenthesis, because there are lots of uses of => which do not correspond with a case and I believe some of them can come between a case and it's corresponding =>.

So a better approach would be to put this logic into the parser, using lexical feedback to communicate the state of the newline-region stack, which is what is assumed in the calls to yypush_parse above. Specifically, they share one boolean variable between the scanner and the parser (by passing a pointer to the parser). [Note 3] The parser then maintains the value of this variable in MRAs in each rule which matches a region of potentially different newlinedness, using the parse stack itself as a stack. Here's a small excerpt of a (theoretical) parser:

%define api.pure full
%define api.push-pull push
%locations
%parse-param { bool* nl_enabled; }
/* More prologue */
%%
// ...
/* Some example productions which modify nl_enabled: */

/* The actions always need to be before the token, because they need to take
 * effect before the next lookahead token is requested.
 * Note how the first MRA's semantic value is used to keep the old value
 * of the variable, so that it can be restored in the second MRA.
 */
TypeArgs          :  <boolean>{ $$ = *nl_enabled; *nl_enabled = false; }
                     '[' Types
                     { *nl_enabled = $1; }  ']'

CaseClause        :  <boolean>{ $$ = *nl_enabled; *nl_enabled = false; }
                     "case" Pattern opt_Guard 
                     { *nl_enabled = $1; }   "=>" Block

FunDef             : FunSig opt_nl
                     <boolean>{ $$ = *nl_enabled; *nl_enabled = true; }
                     '{' Block  
                     { *nl_enabled = $3; }   '}'

Notes:

Push parsers have many other advantages; IMHO they are they are the solution of choice. In particular, using push parsers avoids the circular header dependency which plagues attempts to build pure parser/scanner combinations.
There is still the question of multiline comments with preceding and trailing text:
```
return   /* Is this comment
            a newline?       */ 42
```
I'm not going to try to answer that question.)
It would be possible to keep this flag in the YYLTYPE structure, since only one instance of yylloc is ever used in this example. That might be a reasonable optimisation, since it cuts down on the number of parameters sent to yypush_parse. But it seemed a bit fragile, so I opted for a more general solution here.

来源：https://stackoverflow.com/questions/62987461/how-to-parse-new-line-in-scala-grammar-with-flex-bison

标签

scala

bison

flex-lexer