parse bibtex with flex+bison: revisited

久未见 提交于 2019-11-28 06:51:39

问题


For last few weeks, I am trying to write a parser for bibtex (http://www.bibtex.org/Format/) file using flex and bison.

$ cat raw.l
%{
#include "raw.tab.h" 
%}
value [\"\{][a-zA-Z0-9 .\t\{\} \"\\]*[\"\}]
%%
[a-zA-Z]*               return(KEY);
\"                          return(QUOTE);
\{                          return(OBRACE);
\}                          return(EBRACE);
;                           return(SEMICOLON);
[ \t]+                  /* ignore whitespace */;
{value}     {
    yylval.sval = malloc(strlen(yytext));
    strncpy(yylval.sval, yytext, strlen(yytext));
    return(VALUE);
}

$ cat raw.y
%{
#include <stdio.h>
%}

//Symbols.
%union
{
 char *sval;
};
%token <sval> VALUE
%token KEY
%token OBRACE
%token EBRACE
%token QUOTE
%token SEMICOLON 

%start Entry
%%

Entry:
     '@'KEY OBRACE VALUE ',' 
     KeyVal
     EBRACE
     ;

KeyVal:
      /* empty */
      | KeyVal '=' VALUE ','
      | KeyVal '=' VALUE 
      ;
%%

int yyerror(char *s) {
  printf("yyerror : %s\n",s);
}

int main(void) {
  yyparse();

}

%% A sample bibtex is:

@Book{a1,
    author = "a {\"m}ook, Rudra Banerjee",
    Title="ASR",
    Publisher="oxf",
    Year="2010",
    Add="UK",
    Edition="1",
}
@Article{a2,
    Author="Rudra Banerjee",
    Title="Fe{\"Ni}Mo",
    Publisher={P{\"R}B},
    Issue="12",
    Page="36690",
    Year="2011",
    Add="UK",
    Edition="1",
}

When I am trying to parse it, its giving syntax error. with GDB, it shows it expect fields in KEY to be declared(probably),

Reading symbols from /home/rudra/Programs/lex/Parsing/a.out...done.
(gdb) Undefined command: "".  Try "help".
(gdb) Undefined command: "Author".  Try "help".
(gdb) Undefined command: "Editor".  Try "help".
(gdb) Undefined command: "Title".  Try "help".
.....

I will be grateful if someone kindly help me on this.


回答1:


Lots of problems. First, your lexer is confused, trying to recognize quoted strings and braced things as a single VALUE as well as trying to recognize single characters like " and {. For quotes, it makes sense to have the lexer recognize the whole string, but for structural things that you want to parse (like braced lists), you need to return single tokens for the parser to parse. Second, when allocating space for a string, you aren't allocating space for a NUL-terminiator. Finally, your grammar looks odd, wanting parse things like = VALUE = VALUE as a KeyValue, which doesn't correspond to anything in a bibtex file.

So first, for the lexer. You want to recognize quoted strings and identifiers, but other things should be single characters:

[A-Za-z][A-Za-z0-9]*      { yylval.sval = strdup(yytext); return KEY; }
\"([^"\]|\\.)*\"          { yylval.sval = strdup(yytext); return VALUE; }
[ \t\n]                   ; /* ignore whitespace */
[{}@=,]                   { return *yytext; }
.                         { fprintf(stderr, "Unrecognized character %c in input\n", *yytext); }

Now you need a parser for the entries:

Input: /* empty */ | Input Entry ;  /* input is zero or more entires */
Entry: '@' KEY '{' KEY ',' KeyVals '}' ;
KeyVals: /* empty */ | KeyVals KeyVal ; /* zero or more keyvals */
KeyVal: KEY '=' VALUE ',' ;

That should parse the example you give.



来源:https://stackoverflow.com/questions/15305789/parse-bibtex-with-flexbison-revisited

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!