Parse line data until keyword with pyparsing

落花浮王杯 提交于 2019-12-12 08:57:03

问题


I'm trying to parse line data and then group them in list.

Here is my script:

from pyparsing import *

data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""

EOL = LineEnd().suppress()
start = Keyword('START').suppress() + EOL
end = Keyword('END').suppress() + EOL

line = SkipTo(LineEnd()) + EOL
lines = start + OneOrMore(start | end | Group(line))

start.setDebug()
end.setDebug()
line.setDebug()

result = lines.parseString(data)
results_list = result.asList()

print(results_list)

This code was inspired by another stackoverflow question: Matching nonempty lines with pyparsing

What I need is to parse everything from START to END line by line and save it to a list per group (everything from START to matching END is one group). However this script put every line in new group.

This is the result:

[['line 2'], ['line 3'], ['line 4'], ['line a'], ['line b'], ['line c'], ['']]

And I want it to be:

[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]

Also it parse an empty string at the end.

I'm a pyparsing beginner so I ask you for your help.

Thanks


回答1:


You could use a nestedExpr to find the text delimited by START and END.

If you use

In [322]: pp.nestedExpr('START', 'END').searchString(data).asList()
Out[322]: 
[[['line', '2', 'line', '3', 'line', '4']],
 [['line', 'a', 'line', 'b', 'line', 'c']]]

then the text is split on whitespace. (Notice we have 'line', '2' above where we want 'line 2' instead). We'd rather it just split only on '\n'. So to fix this we can use the pp.nestedExpr function's content parameter which allows us to control what is considered an item inside the nested list. The source code for nestedExpr defines

content = (Combine(OneOrMore(~ignoreExpr + 
                ~Literal(opener) + ~Literal(closer) +
                CharsNotIn(ParserElement.DEFAULT_WHITE_CHARS,exact=1))
            ).setParseAction(lambda t:t[0].strip()))

by default, where pp.ParserElement.DEFAULT_WHITE_CHARS is

In [324]: pp.ParserElement.DEFAULT_WHITE_CHARS
Out[324]: ' \n\t\r'

This is what causes nextExpr to split on all whitespace. So if we reduce that to simply '\n', then nestedExpr splits the content by lines instead of by all whitespace.


import pyparsing as pp

data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""

opener = 'START'
closer = 'END'
content = pp.Combine(pp.OneOrMore(~pp.Literal(opener) 
                                  + ~pp.Literal(closer) 
                                  + pp.CharsNotIn('\n',exact=1)))
expr = pp.nestedExpr(opener, closer, content=content)

result = [item[0] for item in expr.searchString(data).asList()]
print(result)

yields

[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]


来源:https://stackoverflow.com/questions/28731655/parse-line-data-until-keyword-with-pyparsing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!