Using PyParsing to parse language with signficant newlines (like Python)

问题

I am implementing a language where the newlines are significant, sometime, as in Python, with exactly the same rules.

For the purpose of my question we can take the Python fragment that has to do with assignments, parentheses, and the treatment of newlines and semicolons.

For example, one could write:

a = 1 + 2 + 3    # ok
b = c

but not

a = 1 + 2 + 3     b = c   # incorrect

because one needs a newline to divide the two statements.

However we can have

a = 1 + 2 + 3;     b = c   # ok

using the semicolon.

Also it is not allowed to have

a = 1 + 2 +   # incorrect
3
b = c

because there cannot be line breaks in a statement.

However, it is possible to have

a = 1 + 2 + (     # ok
3)
b = c

a = 1 + 2 + \     # ok
3
b = c

I have been trying to implement the rules above but I'm stuck.

First, I use

ParserElement.setDefaultWhitespaceChars(' \t')

so that now \n is significant.

I manage well to impose newlines as a separator using

lines = ZeroOrMore(line + OneOrMore(LineEnd()))

A variation of this allows to have ; as separator as well. (I cannot quite deal with the continuation bracket \.)

I use infixNotation to define +, -, /, *.

The part that I am stuck with is that newlines should be ignored inside the parantheses, like in this case:

a = 1 + 2 + ( 
3 +
1)

I think here something that can play a role is using setWhitespaceChars on the parentheses expression (LPAR + term + RPAR) that infixNotation generates, however, that does not work because the whitespace characters are not inherited by the lower expressions.

Does anybody have any hint?

My question can also be expressed as "how do I parse (a fragment of) Python with pyParsing?". I thought I could find some example project, but I didn't. Googling, I have seen people refer to the examples in the pyParsing repo, however parsePythonValue.py is about parsing values (which I can do already) and not dealing with significant newlines, and pythongGrammarParsing.py is about parsing the BNF grammar for Python, not parsing Python.

回答1:

NOTE: THIS IS NOT A WORKING SOLUTION (at least not yet). IT RELIES ON UNRELEASED CHANGES TO PYPARSING, WHICH DON'T EVEN PASS ALL UNIT TESTS YET. I AM POSTING IT JUST AS A WAY TO DESCRIBE A POSSIBLE APPROACH TO A SOLUTION.

Ooof! This was a lot more difficult than I thought it should be. To implement, I used pyparsing's ignore mechanism with parse actions attached to the lpar and rpar expressions to ignore <NL>'s inside parens, but not outside. This also required adding the ability to clear the ignoreExprs list by calling expr.ignore(None). Here is how your code might look:

import pyparsing as pp

# works with and without packrat
pp.ParserElement.enablePackrat()

pp.ParserElement.setDefaultWhitespaceChars(' \t')

operand = pp.Word(pp.nums)
var = pp.Word(pp.alphas)

arith_expr = pp.Forward()
arith_expr.ignore(pp.pythonStyleComment)
lpar = pp.Suppress("(")
rpar = pp.Suppress(")")

# code to implement selective ignore of NL's inside ()'s
NL = pp.Suppress("\n")
base_ignore = arith_expr.ignoreExprs[:]
ignore_stack = base_ignore[:]
def lpar_pa():
    ignore_stack.append(NL)
    arith_expr.ignore(NL)
    #~ print('post-push', arith_expr.ignoreExprs)
def rpar_pa():
    ignore_stack.pop(-1)
    arith_expr.ignore(None)
    for e in ignore_stack:
        arith_expr.ignore(e)
    #~ print('post-pop', arith_expr.ignoreExprs)
def reset_stack(*args):
    arith_expr.ignore(None)
    for e in base_ignore:
        arith_expr.ignore(e)
    #~ print('post-reset', arith_expr.ignoreExprs)
lpar.addParseAction(lpar_pa)
rpar.addParseAction(rpar_pa)
arith_expr.setFailAction(reset_stack)
arith_expr.addParseAction(reset_stack)

# now define the infix notation as usual
arith_expr <<= pp.infixNotation(operand | var,
    [
    ("-", 1, pp.opAssoc.RIGHT),
    (pp.oneOf("* /"), 2, pp.opAssoc.LEFT),
    (pp.oneOf("- +"), 2, pp.opAssoc.LEFT),
    ],
    lpar=lpar, rpar=rpar
    )

assignment = var + '=' + arith_expr

# Try it out!
assignment.runTests([
"""a = 1 + 3""",
"""a = (1 + 3)""",
"""a = 1 + 2 + ( 
3 +
1)""",
"""a = 1 + 2 + (( 
3 +
1))""",
"""a = 1 + 2 +   
3""",
], fullDump=False)

Prints:

a = 1 + 3
['a', '=', ['1', '+', '3']]
a = (1 + 3)
['a', '=', ['1', '+', '3']]
a = 1 + 2 + ( 
3 +
1)
['a', '=', ['1', '+', '2', '+', ['3', '+', '1']]]
a = 1 + 2 + (( 
3 +
1))
['a', '=', ['1', '+', '2', '+', ['3', '+', '1']]]
a = 1 + 2 +   
3
a = 1 + 2 +   
          ^
FAIL: Expected end of text, found '+'  (at char 10), (line:1, col:11)>Exit code: 0

So it is not out of the realm of possibility, but it does take some heroic efforts.

来源：https://stackoverflow.com/questions/61169403/using-pyparsing-to-parse-language-with-signficant-newlines-like-python

标签

python

parsing

pyparsing