Are GCC and Clang parsers really handwritten?

后端未结

关注

 6  1496

独厮守ぢ 2020-11-29 15:15

It seems that GCC and LLVM-Clang are using handwritten recursive descent parsers, and not machine generated, Bison-Flex based, bottom up pa

6条回答

萌比男神i (楼主)

2020-11-29 16:14

Weird answers there!

C/C++ grammars aren't context free. They are context sensitive because of the Foo * bar; ambiguity. We have to build a list of typedefs to know if Foo is a type or not.

Ira Baxter: I don't see the point with your GLR thing. Why build a parse tree which comprises ambiguities. Parsing means solving ambiguities, building the syntax tree. You resolve these ambiguities in a second pass, so this isn't less ugly. For me it is far more ugly ...

Yacc is a LR(1) parser generator (or LALR(1)), but it can be easily modified to be context sensitive. And there is nothing ugly in it. Yacc/Bison has been created to help in parsing C language, so probably it isn't the ugliest tool to generate a C parser ...

Until GCC 3.x the C parser is generated by yacc/bison, with typedefs table built during parsing. With "in parse" typedefs table building, C grammar becomes locally context free and furthermore "locally LR(1)".

Now, in Gcc 4.x, it is a recursive descent parser. It is exactly the same parser as in Gcc 3.x, it is still LR(1), and has the same grammar rules. The difference is that the yacc parser has been hand rewritten, the shift/reduce are now hidden in the call stack, and there is no "state454 : if (nextsym == '(') goto state398" as in gcc 3.x yacc's parser, so it is easier to patch, handle errors and print nicer messages, and to perform some of the next compiling steps during parsing. At the price of much less "easy to read" code for a gcc noob.

Why did they switched from yacc to recursive descent? Because it is quite necessary to avoid yacc to parse C++, and because GCC dreams to be multi language compiler, i.e. sharing maximum of code between the different languages it can compile. This is why the C++ and the C parser are written in the same way.

C++ is harder to parse than C because it isn't "locally" LR(1) as C, it is not even LR(k). Look at func<4 > 2> which is a template function instantiated with 4 > 2, i.e. func<4 > 2> has to be read as func<1>. This is definitely not LR(1). Now consider, func<4 > 2 > 1 > 3 > 3 > 8 > 9 > 8 > 7 > 8>. This is where a recursive descent can easily solve ambiguity, at the price of a few more function calls (parse_template_parameter is the ambiguous parser function. If parse_template_parameter(17tokens) failed, try again parse_template_parameter(15tokens), parse_template_parameter(13tokens) ... until it works).

I don't know why it wouldn't be possible to add into yacc/bison recursive sub grammars, maybe this will be the next step in gcc/GNU parser development?

0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...