Troubles using Bison's recursive rules, and storing values using it

问题

I am trying to make a flex+bison scanner and parser for Newick file format trees in order to do operations on them. The implemented grammar an explanation is based on a simplification of (labels and lengths are always of the same type, returned by flex) this example.

This is esentially a parser for a file format which represents a tree with a series of (recursive) subtrees and/or leaves. The main tree will always end on ; and said tree and all subtrees within will contain a series of nodes between ( and ), with a name and a length to the right of the rightmost parenthesis specified by name and :length, which are optional (you can avoid specifying them, put one of them (name or :length), or both with name:length).

If any node lacks either the name or a length, default values will be applied. (for example: 'missingName' and '1')

An example would be (child1:4, child2:6)root:6; , ((child1Of1:2, child2Of1:9)child1:5, child2:6)root:6;

The implementation of said grammar is the following one (NOTE: I translated my own code, as it was in my language, and lots of side stuff got removed for clarity):

    struct node {
        char* name; /*the node's assigned name, either from the file or from default values*/
        float length; /*node's length*/
    } dataOfNode;
}

 

%start tree
 
%token<dataOfNode> OP CP COMMA SEMICOLON COLON DISTANCE NAME

%type<dataOfNode> tree subtrees recursive_subtrees subtree leaf


%%

tree:   subtrees NAME COLON DISTANCE SEMICOLON          {} // with name and distance    
        | subtrees NAME SEMICOLON                       {} // without distance
        | subtrees COLON DISTANCE SEMICOLON             {} // without name
        | subtrees SEMICOLON                            {} // without name nor distance
        ;
    
    
subtrees:   OP recursive_subtrees CP                    {}
            ;   
        
            
recursive_subtrees: subtree                             {} // just one subtree, or the last one of the list         
                    | recursive_subtrees COMMA subtree  {} // (subtree, subtree, subtree...)        
    

subtree:    subtrees NAME COLON DISTANCE    {   $$.NAME= $2.name; $$.length = $4.length; $$.lengthAcum = $$.lengthAcum + $4.length;
                                                } // group of subtrees, same as the main tree but without ";" at the end, with name and distance                                        
            
            | subtrees NAME                 {   $$.name= $2.name; $$.length = 1.0;}             // without distance                                 
            
            | subtrees COLON DISTANCE       {   $$.name= "missingName"; $$.length = $3.length;} // without name                             
           
            | subtrees                      {   $$.name= "missingName"; $$.length = 1.0;}       // without name nor distance                            
            
            | leaf                          {   $$.name= $1.name; $$.length = $1.length;}       // a leaf
                    
                    
                    
leaf:   NAME COLON DISTANCE {   $$.name= $$.name; $$.length = $3.length;}       // with name and distance
        | NAME              {   $$.name= $1.name; $$.length = 1.0;}             // without distance
        | COLON DISTANCE    {   $$.name= "missingName"; $$.length = $2.length;} // without name
        |                   {   $$.name= "missingName"; $$.length = 1.0;}       // without name nor distance
        ;


%%

Now, let's say that I want to distinguish who is the parent of each subtree and leaf, so that I can accumulate the length of a parent subtree with the length of the "longest" child, recursively.

I do not know if I chose a bad design for this, but I can't get past assigning names and lengths to each subtree (and leaf, which is also considered a subtree), and I don't think I understand either how recursivity works in order to identify the parents in the matching process.

回答1:

This is mostly a matter of defining the data structure you want to hold your trees, and building that "bottom up" in the actions of the rules. The "bottom up" part is an important implication of the way that bison parsers work -- they are "bottom up", recognizing constructs from the leaves of the grammar and then assembling them into higher non-terminals (and ulitimately into the start non-terminal, which will be the last action run). You can also simplify things by not having so many redundant rules. Finally, IMO it's always better to use character literals for single character tokens rather than names. So you might end up with:

%{
struct tree {
    struct tree  *next;     /* each tree is potentially part of a list of (sub)trees */
    struct tree  *subtree;  /* and may contain a subtress */
    const char   *name;
    double       length;
};

struct tree *new_leaf(const char *name, double length);   /* malloc a new leaf "tree" */
void append_tree(struct tree **list, struct tree *t);  /* append a tree on to a list of trees */
%}

%union {
    const char   *name;
    double       value;
    struct tree  *tree;
}

%type<tree> subtrees recursive_subtrees subtree leaf
%token<name> NAME
%token<value> DISTANCE

%%

tree: subtrees leaf ';'  { $2->subtree = $1;  print_tree($2); } 
    ;
    
subtrees: '(' recursive_subtrees ')'  { $$ = $2; }
        ;   
            
recursive_subtrees: subtree                         { $$ = $1; } // just one subtree, or the last one of the list
                  | recursive_subtrees ',' subtree  { append_tree(&($$=$1)->next, $3); } // (subtree, subtree, subtree...)
                  ;

subtree: subtrees leaf  { ($$=$2)->subtree = $1; }
       | leaf           { $$ = $1; }
       ;
                    
leaf: NAME ':' DISTANCE { $$ = new_leaf($1, $3);}              // with name and distance
    | NAME              { $$ = new_leaf($1, 1.0);}             // without distance
    | ':' DISTANCE      { $$ = new_leaf("missingName", $2; }   // without name
    |                   { $$ = new_leaf("missingName", 1.0); } // without name nor distance
    ;

%%

struct tree *new_leaf(const char *name, double length) {
    struct tree *rv = malloc(sizeof(struct tree));
    rv->subtree = rv->next = NULL;
    rv->name = name;
    rv->length = length;
    return rv;
}

void append_tree(struct tree **list, struct tree *t) {
    assert(t->next == NULL);  // must not be part of a list yet
    while (*list) list = &(*list)->next;
    *list = t;
}

来源：https://stackoverflow.com/questions/65062740/troubles-using-bisons-recursive-rules-and-storing-values-using-it

标签

grammar

bison

context-free-grammar