LL top-down parser, from CST to AST

问题

I am currently learning about syntax analysis, and more especially, top-down parsing.

I know the terminology and the difference with bottom-up LR parsers, and since top-down LL parsers are easier to implement by hand, I am looking forward to make my own.

I have seen two kinds of approach:

The recursive-descent one using a collection of recursive functions.
The stack-based and table-driven automaton as shown here on Wikipedia.

I am more interested by the latter, for its power and its elimination of call-stack recursion. However, I don't understand how to build the AST from the implicit parse tree.

This code example of a stack-based finite automaton show the parser analyzing the input buffer, only giving a yes/no answer if the syntax has been accepted.

I have heard of stack annotations in order to build the AST, but I can't figure out how to implement them. Can someone provide a practical implementation of such technique ?

回答1:

You need to understand the concept behind. You need to understand the concept of pushdown automaton. After you understand how to make computation on paper with pencil you will be able to understand multiple ways to implement its idea, via recursive descent or with stack. The ideas are the same, when you use recursive descent you implicitly have the stack that the program use for execution, where the execution data is combined with the parsing automaton data.

I suggest you to start with the course taught by Ullman (automata) or Dick Grune, this one is the best focused on parsing. (the book of Grune is this one), look for the 2nd edition.

For LR parsing the essential is to understand the ideas of Earley, from these ideas Don Knuth created the LR method.

For LL parsing, the book of Grune is excellent, and Ullman presents the computation on paper, the math background of parsing that is essential to know if you want to implement your own parsers.

Concerning the AST, this is the output of parsing. A parser will generate a parsing tree that is transformed in AST or can generate and output directly the AST.

回答2:

"Top-down" and "bottom-up" are excellent descriptions of the two parsing strategies, because they describe precisely how the syntax tree would be constructed if it were constructed. (You can also think of it as the traversal order over the implicit parse tree but here we're actually interested in real parse trees.)

It seems clear that there is an advantage to bottom-up tree construction. When it is time to add a node to the tree, you already know what its children are. You can construct the node fully-formed in one (functional) action. All the child information is right there waiting for you, so you can add semantic information to the node based on the semantic information of its children, even using the children in an order other than left-to-right.

By contrast, the top-down parser constructs the node without any children, and then needs to add each child in turn to the already constructed node. That's certainly possible, but it's a bit ugly. Also, the incremental nature of the node constructor means that semantic information attached to the node also needs to be computed incrementally, or deferred until the node is fully constructed.

In many ways, this is similar to the difference between evaluating expressions written in Reverse Polish Notation (RPN) from expressions written in (Forward) Polish Notation [Note 1]. RPN was invented precisely to ease evaluation, which is possible with a simple value stack. Forward Polish expressions can be evaluated, obviously: the easiest way is to use a recursive evaluator but in environments where the call stack can not be relied upon it is possible to do it using an operator stack, which effectively turns the expression into RPN on the fly.

So that's probably the mechanism of choice for building syntax trees from top-down parsers as well. We add a "reduction" marker to the end of every right-hand side. Since the marker goes at the end of the right-hand side, so it is pushed first.

We also need a value stack, to record the AST nodes (or semantic values) being constructed.

In the basic algorithm, we now have one more case. We start by popping the top of the parser stack, and then examine this object:

The top of the parser stack was a terminal. If the current input symbol is the same terminal, we remove the input symbol from the input, and push it (or its semantic value) onto the value stack.
The top of the parser stack was a marker. The associated reduction action is triggered, which will create the new AST node by popping an appropriate number of values from the value stack and combining them into a new AST node which is then pushed onto the value stack. (As a special case, the marker action for the augmented start symbol's unique production S' -> S $ causes the parse to be accepted, returning the (only) value in the value stack as the AST.)
The top of the parser stack was a non-terminal. We then identify the appropriate right-hand side using the current input symbol, and push that right-hand side (right-to-left) onto the parser stack.

来源：https://stackoverflow.com/questions/54706455/ll-top-down-parser-from-cst-to-ast

标签

parsing

abstract-syntax-tree

state-machine