Why getparent() don't work as expected?

别等时光非礼了梦想. 提交于 2019-12-10 15:55:05

问题


I need to make some manipulations with text inside one of tags and want to get parent tag for every found text node for it

Code:

import lxml.etree
import pprint
s = '''
<data>
    data text
    <foo>foo - <bar>bar</bar> text</foo>
    data text
    <bar>
        bar text
        <baz>baz text</baz>
        <baz>baz text</baz>
        bar text
    </bar>
    data text
</data>
'''
etree = lxml.etree.fromstring(s)
text = etree.xpath("//text()[normalize-space()]")
pprint.pprint([(s.getparent().tag, s.strip()) for s in text])

Output:

[('data', 'data text'),
 ('foo', 'foo -'),
 ('bar', 'bar'),
 ('bar', 'text'),
 ('foo', 'data text'),
 ('bar', 'bar text'),
 ('baz', 'baz text'),
 ('baz', 'baz text'),
 ('baz', 'bar text'),
 ('bar', 'data text')]

I expected:

[('data', 'data text'),
 ('foo', 'foo -'),
 ('bar', 'bar'),
 ('foo', 'text'),
 ('data', 'data text'),
 ('bar', 'bar text'),
 ('baz', 'baz text'),
 ('baz', 'baz text'),
 ('bar', 'bar text'),
 ('data', 'data text')]

Where is my mistake? Looks like tags in my output - is not parent tag for text in tree, but simply previous tag.

Edit Working code for my needs:

etree = lxml.etree.fromstring(s)
text = etree.xpath("//text()[normalize-space()]")
for s in text:
    if s.is_tail:
        print(s.getparent().getparent().tag, s.strip())
    else:
        print(s.getparent().tag, s.strip())

回答1:


What you are seeing has to do with the tail property (text immediately following an end tag), which is a peculiarity of the ElementTree and lxml way of representing XML.

By adding a is_tail test (returns True if the text is "tail text") to your code, you can see what's happening:

import lxml.etree
import pprint

s = '''
<data>
    data text
    <foo>foo - <bar>bar</bar> text</foo>
    data text
    <bar>
        bar text
        <baz>baz text</baz>
        <baz>baz text</baz>
        bar text
    </bar>
    data text
</data>
'''

etree = lxml.etree.fromstring(s)
text = etree.xpath("//text()[normalize-space()]")
pprint.pprint([(s.getparent().tag, s.is_tail, s.strip()) for s in text])

Output:

[('data', False, 'data text'),
 ('foo', False, 'foo -'),
 ('bar', False, 'bar'),§
 ('bar', True, 'text'),
 ('foo', True, 'data text'),
 ('bar', False, 'bar text'),
 ('baz', False, 'baz text'),
 ('baz', False, 'baz text'),
 ('baz', True, 'bar text'),
 ('bar', True, 'data text')] 



回答2:


This, as far as I can see, is due to the "tail" concept in lxml (See : 2. How ElementTree represents XML). When content of an element contains mixture of element nodes and text nodes, the text node represented as 'tail' of the preceding element or represented normally as child of the parent element only if it comes first.

You can call getparent() twice to get the actual parent in case of a 'tail' text node (is_tail=True), for example :

pprint.pprint(
    [(s.getparent().getparent().tag if s.is_tail else s.getparent().tag,
      s.strip())
     for s in text]
    )

output :

[('data', 'data text'),
 ('foo', 'foo -'),
 ('bar', 'bar'),
 ('foo', 'text'),
 ('data', 'data text'),
 ('bar', 'bar text'),
 ('baz', 'baz text'),
 ('baz', 'baz text'),
 ('bar', 'bar text'),
 ('data', 'data text')]


来源:https://stackoverflow.com/questions/31770189/why-getparent-dont-work-as-expected

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!