parsing nested structures with pyparsing

我的未来我决定 提交于 2019-12-10 20:46:13

问题


I'm trying to parse a particular syntax for positions in biological sequences. The positions can have forms like:

12           -- a simple position in the sequence
12+34        -- a complex position as a base (12) and offset(+34)
12_56        -- a range, from 12 to 56
12+34_56-78  -- a range as a start to end, where either or both may be simple or complex

I'd like to have these parsed as dicts, roughly like this:

12          -> { 'start': { 'base': 12, 'offset': 0 },  'end': None }
12+34       -> { 'start': { 'base': 12, 'offset': 34 }, 'end': None }
12_56       -> { 'start': { 'base': 12, 'offset': 0 },
                   'end': { 'base': 56, 'offset': 0 } }
12+34_56-78 -> { 'start': { 'base': 12, 'offset': 0 }, 
                   'end': { 'base': 56, 'offset': -78 } }

I've made several stabs using pyparsing. Here's one:

from pyparsing import *
integer = Word(nums)
signed_integer = Word('+-', nums)
underscore = Suppress('_')
position = integer.setResultsName('base') + Or(signed_integer,Empty).setResultsName('offset')
interval = position.setResultsName('start') + Or(underscore + position,Empty).setResultsName('end')

The results are close to what I want:

In [20]: hgvspyparsing.interval.parseString('12-34_56+78').asDict()
Out[20]: 
{'base': '56',
'end': (['56', '+78'], {'base': [('56', 0)], 'offset': [((['+78'], {}), 1)]}),
'offset': (['+78'], {}),
'start': (['12', '-34'], {'base': [('12', 0)], 'offset': [((['-34'], {}), 1)]})}

Two questions:

  1. asDict() only worked on the root parseResult. Is there a way to cajole pyparsing into returning a nested dict (and only that)?

  2. How do I get the optionality of the end of a range and the offset of a position? The Or() in the position rule doesn't cut it. (I tried similarly for the end of the range.) Ideally, I'd treat all positions as special cases of the most complex form (i.e., { start: {base, end}, end: { base, end } }), where the simpler cases use 0 or None.)

Thanks!


回答1:


Some general pyparsing tips:

Or(expr, empty) is better written as Optional(expr). Also, your Or expression was trying to create an Or with the class Empty, you probably meant to write Empty() or empty for the second argument.

expr.setResultsName("name") can now be written as expr("name")

If you want to apply structure to your results, use Group.

Use dump() instead of asDict() to better view the structure of your parsed results.

Here is how I would build up your expression:

from pyparsing import Word, nums, oneOf, Combine, Group, Optional

integer = Word(nums)

sign = oneOf("+ -")
signedInteger = Combine(sign + integer)

integerExpr = Group(integer("base") + Optional(signedInteger, default="0")("offset"))

integerRange = integerExpr("start") + Optional('_' + integerExpr("end"))


tests = """\
12
12+34
12_56
12+34_56-78""".splitlines()

for t in tests:
    result = integerRange.parseString(t)
    print t
    print result.dump()
    print result.asDict()
    print result.start.base, result.start.offset
    if result.end:
        print result.end.base, result.end.offset
    print

Prints:

12
[['12', '0']]
- start: ['12', '0']
  - base: 12
  - offset: 0
{'start': (['12', '0'], {'base': [('12', 0)], 'offset': [('0', 1)]})}
12 0

12+34
[['12', '+34']]
- start: ['12', '+34']
  - base: 12
  - offset: +34
{'start': (['12', '+34'], {'base': [('12', 0)], 'offset': [('+34', 1)]})}
12 +34

12_56
[['12', '0'], '_', ['56', '0']]
- end: ['56', '0']
  - base: 56
  - offset: 0
- start: ['12', '0']
  - base: 12
  - offset: 0
{'start': (['12', '0'], {'base': [('12', 0)], 'offset': [('0', 1)]}), 'end': (['56', '0'], {'base': [('56', 0)], 'offset': [('0', 1)]})}
12 0
56 0

12+34_56-78
[['12', '+34'], '_', ['56', '-78']]
- end: ['56', '-78']
  - base: 56
  - offset: -78
- start: ['12', '+34']
  - base: 12
  - offset: +34
{'start': (['12', '+34'], {'base': [('12', 0)], 'offset': [('+34', 1)]}), 'end': (['56', '-78'], {'base': [('56', 0)], 'offset': [('-78', 1)]})}
12 +34
56 -78



回答2:


Is the actual syntax more complicated than your examples? Because the parsing can be done fairly easily in pure Python:

bases = ["12", "12+34", "12_56", "12+34", "12+34_56-78"]

def parse_base(base_string):

    def parse_single(s):
        if '-' in s:
            offset_start = s.find("-")
            base, offset = int(s[:offset_start]), int(s[offset_start:])
        elif '+' in s:
            offset_start = s.find("+")
            base, offset = int(s[:offset_start]), int(s[offset_start:])
        else:
            base = int(s)
            offset = 0
        return {'base': base, 'offset': offset}

    range_split = base_string.split('_')
    if len(range_split) == 1:
        start = range_split[0]
        return {'start': parse_single(start), 'end': None}
    elif len(range_split) == 2:
        start, end = range_split
        return {'start': parse_single(start),
                'end': parse_single(end)}

Output:

for b in bases:
     print(parse_base(b))

{'start': {'base': 12, 'offset': 0}, 'end': None}
{'start': {'base': 12, 'offset': 34}, 'end': None}
{'start': {'base': 12, 'offset': 0}, 'end': {'base': 56, 'offset': 0}}
{'start': {'base': 12, 'offset': 34}, 'end': None}
{'start': {'base': 12, 'offset': 34}, 'end': {'base': 56, 'offset': -78}}


来源:https://stackoverflow.com/questions/19310282/parsing-nested-structures-with-pyparsing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!