How to reduce parser stack or 'unshift' the current token depending on what follows?

问题

Given the following language described as:

formally: (identifier operator identifier+)*
in plain English: zero or more operations written as an identifier (the lvalue), then an operator, then one or more identifiers (the rvalue)

An example of a sequence of operations in that language would be, given the arbitrary operator @:

A @ B C X @ Y

Whitespace is not significant and it may also be written more clearly as:

A @ B C
X @ Y

How would you parse this with a yacc-like LALR parser ?

What I tried so far

I know how to parse explicitly delimited operations, say A @ B C ; X @ Y but I would like to know if parsing the above input is feasible and how. Hereafter is a (non-functional) minimal example using Flex/Bison.

lex.l:

%{
#include "y.tab.h"
%}

%option noyywrap
%option yylineno

%%
[a-zA-Z][a-zA-Z0-9_]*   { return ID; }
@                       { return OP; }
[ \t\r\n]+              ; /* ignore whitespace */
.                       { return ERROR; } /* any other character causes parse error */
%%

yacc.y:

%{
#include <stdio.h>

extern int yylineno;
void yyerror(const char *str);
int yylex();
%}

%define parse.lac full
%define parse.error verbose

%token ID OP ERROR
%left OP

%start opdefs

%%
opright:
       | opright ID
       ;

opdef: ID OP ID opright
     ;

opdefs:
      | opdefs opdef
      ;
%%

void yyerror(const char *str) {
    fprintf(stderr, "error@%d: %s\n", yylineno, str);
}

int main(int argc, char *argv[]) {
    yyparse();
}

Build with: $ flex lex.l && yacc -d yacc.y --report=all --verbose && gcc lex.yy.c y.tab.c

The issue: I cannot get the parser to not include the next lvalue identifier to the rvalue of the first operation.

$ ./a.out
A @ B C X @ Y
error@1: syntax error, unexpected OP, expecting $end or ID

The above is always parsed as: reduce(A @ B reduce(C X)) @ Y

I get the feeling I have to somehow put a condition on the lookahead token that says that if it is the operator, the last identifier should not be shifted and the current stack should be reduced:

A @ B C X @ Y
        ^ *    // ^: current, *: lookahead
-> reduce 'A @ B C' !
-> shift 'X' !

I tried all kind of operator precedence arrangements but cannot get it to work.

I would be willing to accept a solution that does not apply to Bison as well.

回答1:

A naïve grammar for that language is LALR(2), and bison does not generate LALR(2) parsers.

Any LALR(2) grammar can be mechanically modified to produce an LALR(1) grammar with a compatible parse tree, but I don't know of any automatic tool which does that.

It's possible but annoying to do the transformation by hand, but be aware that you will need to adjust the actions in order to recover the correct parse tree:

%{
  typedef struct IdList  { char* id; struct IdList* next; };
  typedef struct Def     { char* lhs; IdList* rhs; };
  typedef struct DefList { Def* def; struct DefList* next; };
%}
union {
  Def*     def;
  DefList* defs;
  char*    id;
}
%type <def>  ophead
%type <defs> opdefs
%token <id>   ID

%%

prog  : opdefs        { $1->def->rhs = IdList_reverse($1->def->rhs);
                        DefList_show(DefList_reverse($1)); }
ophead: ID '@' ID     { $$ = Def_new($1);
                        $$->rhs = IdList_push($$->rhs, $3); } 
opdefs: ophead        { $$ = DefList_push(NULL, $1); }
      | opdefs ID     { $1->def->rhs = IdList_push($1->def->rhs, $2); }
      | opdefs ophead { $1->def->rhs = IdList_reverse($1->def->rhs);
                        $$ = DefList_push($1, $2); }

This precise problem is, ironically, part of bison itself, because productions do not require a ; terminator. Bison uses itself to generate a parser, and it solves this problem in the lexer rather than jumping through the loops as outlined above. In the lexer, once an ID is found, the scan continues up to the next non-whitespace character. If that is a :, then the lexer returns an identifier-definition token; otherwise, the non-whitespace character is returned to the input stream, and an ordinary identifier token is returned.

Here's one way of implementing that in the lexer:

%x SEEK_AT
%%
  /* See below for explanation, if needed */
  static int deferred_eof = 0;
  if (deferred_eof) { deferred_eof = 0; return 0; }
[[:alpha:]][[:alnum:]_]*  yylval = strdup(yytext); BEGIN(SEEK_AT);
[[:space:]]+              ;                /* ignore whitespace */
   /* Could be other rules here */
.                         return *yytext;  /* Let the parser handle errors */

<SEEK_AT>{
  [[:space:]]+            ;                /* ignore whitespace */
  "@"                     BEGIN(INITIAL); return ID_AT;
  .                       BEGIN(INITIAL); yyless(0); return ID;
  <EOF>                   BEGIN(INITIAL); deferred_eof = 1; return ID;
}

In the SEEK_AT start condition, we're only interested in @. If we find one, then the ID was the start of a def, and we return the correct token type. If we find anything else (other than whitespace), we return the character to the input stream using yyless, and return the ID token type. Note that yylval was already set from the initial scan of the ID, so there is no need to worry about it here.

The only complicated bit of the above code is the EOF handling. Once an EOF has been detected, it is not possible to reinsert it into the input stream, neither with yyless nor with unputc. Nor is it legal to let the scanner read the EOF again. So it needs to be fully dealt with. Unfortunately, in the SEEK_AT start condition, fully dealing with EOF requires sending two tokens: first the already detected ID token, and then the 0 which yyparse will recognize as end of input. Without a push-parser, we cannot send two tokens from a single scanner action, so we need to register the fact of having received an EOF, and check for that on the next call to the scanner.

Indented code before the first rule is inserted at the top of the yylex function, so it can declare local variables and do whatever needs to be done before the scan starts. As written, this lexer is not re-entrant, but it is restartable because the persistent state is reset in the if (deferred_eof) action. To make it re-entrant, you'd only need to put deferred_eof in the yystate structure instead of making it a static local.

回答2:

Following rici's useful comment and answer, here is what I came up with:

lex.l:

%{
#include "y.tab.h"
%}

%option noyywrap
%option yylineno

%%
[a-zA-Z][a-zA-Z0-9_]*   { yylval.a = strdup(yytext); return ID; }
@                       { return OP; }
[ \t\r\n]+              ; /* ignore whitespace */
.                       { return ERROR; } /* any other character causes parse error */
%%

yacc.y:

%{
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>

extern int yylineno;
void yyerror(const char *str);
int yylex();


#define STR_OP    " @ "
#define STR_SPACE " "

char *concat3(const char *, const char *, const char *);

struct oplist {
    char **ops;
    size_t capacity, count;
} my_oplist = { NULL, 0, 0 };

int oplist_append(struct oplist *, char *);
void oplist_clear(struct oplist *);
void oplist_dump(struct oplist *);
%}

%union {
    char *a;
}

%define parse.lac full
%define parse.error verbose

%token ID OP END ERROR

%start input

%%

opbase: ID OP ID {
         char *s = concat3($<a>1, STR_OP, $<a>3);
         free($<a>1);
         free($<a>3);
         assert(s && "opbase: allocation failed");

         $<a>$ = s;
     }
     ;

ops: opbase {
       $<a>$ = $<a>1;
   }
   | ops opbase {
       int r = oplist_append(&my_oplist, $<a>1);
       assert(r == 0 && "ops: allocation failed");

       $<a>$ = $<a>2;
   }
   | ops ID {
       char *s = concat3($<a>1, STR_SPACE, $<a>2);
       free($<a>1);
       free($<a>2);
       assert(s && "ops: allocation failed");

       $<a>$ = s;
   }
   ;

input: ops {
         int r = oplist_append(&my_oplist, $<a>1);
         assert(r == 0 && "input: allocation failed");
     }
     ;       
%%

char *concat3(const char *s1, const char *s2, const char *s3) {
    size_t len = strlen(s1) + strlen(s2) + strlen(s3);
    char *s = malloc(len + 1);
    if (!s)
        goto concat3__end;

    sprintf(s, "%s%s%s", s1, s2, s3);

concat3__end:
    return s;
}


int oplist_append(struct oplist *oplist, char *op) {
    if (oplist->count == oplist->capacity) {  
        char **ops = realloc(oplist->ops, (oplist->capacity + 32) * sizeof(char *));
        if (!ops)
            return 1;

        oplist->ops = ops;
        oplist->capacity += 32;
    } 

    oplist->ops[oplist->count++] = op;
    return 0;
}

void oplist_clear(struct oplist *oplist) {
    if (oplist->count > 0) {
        for (size_t i = 0; i < oplist->count; ++i)
            free(oplist->ops[i]);
        oplist->count = 0;
    }

    if (oplist->capacity > 0) {
        free(oplist->ops);
        oplist->capacity = 0;
    }
}

void oplist_dump(struct oplist *oplist) {
    for (size_t i = 0; i < oplist->count; ++i)
        printf("%2zu: '%s'\n", i, oplist->ops[i]);
}


void yyerror(const char *str) {
    fprintf(stderr, "error@%d: %s\n", yylineno, str);
}

int main(int argc, char *argv[]) {
    yyparse();

    oplist_dump(&my_oplist);
    oplist_clear(&my_oplist);
}

Output with A @ B C X @ Y:

 0: 'A @ B C'
 1: 'X @ Y'

来源：https://stackoverflow.com/questions/35431147/how-to-reduce-parser-stack-or-unshift-the-current-token-depending-on-what-foll

标签

parsing

compiler-construction

grammar

bison