pyparsing whitespace match issues

北战南征 提交于 2019-12-04 07:34:26

I always flinch when whitespace creeps into parsed tokens, but with your constraints that only single spaces are allowed, this should be workable. I used the following expression to define your values that could have embedded single spaces:

# each value consists of printable words separated by at most a 
# single space (a space that is not followed by another space)
value = Combine(OneOrMore(Word(printables) | White(' ',max=1) + ~White()))

With this done, a line is just one or more of these values:

linedefn = OneOrMore(value)

Following your example, including calling str.replace to replace tabs with pairs of spaces, the code looks like:

data = "Library\tSSHClient    with name\tnode"

# replace tabs with 2 spaces
data = data.replace('\t', '  ')

print linedefn.parseString(data)

Giving:

['Library', 'SSHClient', 'with name', 'node']

To get the start and end locations of any values in the original string, wrap the expression in the new pyparsing helper method locatedExpr:

# use new locatedExpr to get the value, start, and end location 
# for each value
linedefn = OneOrMore(locatedExpr(value))('values')

If we parse and dump the results:

print linedefn.parseString(data).dump()

We get:

- values: 
  [0]:
    [0, 'Library', 7]
    - locn_end: 7
    - locn_start: 0
    - value: Library
  [1]:
    [9, 'SSHClient', 18]
    - locn_end: 18
    - locn_start: 9
    - value: SSHClient
  [2]:
    [22, 'with name', 31]
    - locn_end: 31
    - locn_start: 22
    - value: with name
  [3]:
    [33, 'node', 37]
    - locn_end: 37
    - locn_start: 33
    - value: node

LineStart and LineEnd are pyparsing expression classes whose instances should match at the start and end of a line. LineStart has always been difficult to work with, but LineEnd is fairly predictable. In your case, if you just read and parse a line at a time, then you shouldn't need them - just define the contents of the line that you expect. If you want to ensure that the parser has processed the entire string (and not stopped short of the end because of a non-matching character), add + LineEnd() or + StringEnd() to the end of your parser, or add the argument parseAll=True to your call to parseString().

EDIT:

It is easy to forget that pyparsing calls str.expandtabs by default - you have to disable this by calling parseWithTabs. That, and explicitly disallowing TABs between value words resolves your problem, and keeps the values at the correct character counts. See changes below:

from pyparsing import *
TAB = White('\t')

# each value consists of printable words separated by at most a 
# single space (a space that is not followed by another space)
value = Combine(OneOrMore(~TAB + (Word(printables) | White(' ',max=1) + ~White())))

# each line has one or more of these values
linedefn = OneOrMore(value)
# do not expand tabs before parsing
linedefn.parseWithTabs()


data = "Library\tSSHClient    with name\tnode"

# replace tabs with 2 spaces
#data = data.replace('\t', '  ')

print linedefn.parseString(data)


linedefn = OneOrMore(locatedExpr(value))('values')
# do not expand tabs before parsing
linedefn.parseWithTabs()
print linedefn.parseString(data).dump()
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!