问题
I need to make some manipulations with text inside one of tags and want to get parent tag for every found text node for it
Code:
import lxml.etree
import pprint
s = '''
<data>
data text
<foo>foo - <bar>bar</bar> text</foo>
data text
<bar>
bar text
<baz>baz text</baz>
<baz>baz text</baz>
bar text
</bar>
data text
</data>
'''
etree = lxml.etree.fromstring(s)
text = etree.xpath("//text()[normalize-space()]")
pprint.pprint([(s.getparent().tag, s.strip()) for s in text])
Output:
[('data', 'data text'),
('foo', 'foo -'),
('bar', 'bar'),
('bar', 'text'),
('foo', 'data text'),
('bar', 'bar text'),
('baz', 'baz text'),
('baz', 'baz text'),
('baz', 'bar text'),
('bar', 'data text')]
I expected:
[('data', 'data text'),
('foo', 'foo -'),
('bar', 'bar'),
('foo', 'text'),
('data', 'data text'),
('bar', 'bar text'),
('baz', 'baz text'),
('baz', 'baz text'),
('bar', 'bar text'),
('data', 'data text')]
Where is my mistake? Looks like tags in my output - is not parent tag for text in tree, but simply previous tag.
Edit Working code for my needs:
etree = lxml.etree.fromstring(s)
text = etree.xpath("//text()[normalize-space()]")
for s in text:
if s.is_tail:
print(s.getparent().getparent().tag, s.strip())
else:
print(s.getparent().tag, s.strip())
回答1:
What you are seeing has to do with the tail property (text immediately following an end tag), which is a peculiarity of the ElementTree and lxml way of representing XML.
By adding a is_tail test (returns True
if the text is "tail text") to your code, you can see what's happening:
import lxml.etree
import pprint
s = '''
<data>
data text
<foo>foo - <bar>bar</bar> text</foo>
data text
<bar>
bar text
<baz>baz text</baz>
<baz>baz text</baz>
bar text
</bar>
data text
</data>
'''
etree = lxml.etree.fromstring(s)
text = etree.xpath("//text()[normalize-space()]")
pprint.pprint([(s.getparent().tag, s.is_tail, s.strip()) for s in text])
Output:
[('data', False, 'data text'),
('foo', False, 'foo -'),
('bar', False, 'bar'),§
('bar', True, 'text'),
('foo', True, 'data text'),
('bar', False, 'bar text'),
('baz', False, 'baz text'),
('baz', False, 'baz text'),
('baz', True, 'bar text'),
('bar', True, 'data text')]
回答2:
This, as far as I can see, is due to the "tail" concept in lxml
(See : 2. How ElementTree represents XML). When content of an element contains mixture of element nodes and text nodes, the text node represented as 'tail' of the preceding element or represented normally as child of the parent element only if it comes first.
You can call getparent()
twice to get the actual parent in case of a 'tail' text node (is_tail=True
), for example :
pprint.pprint(
[(s.getparent().getparent().tag if s.is_tail else s.getparent().tag,
s.strip())
for s in text]
)
output :
[('data', 'data text'),
('foo', 'foo -'),
('bar', 'bar'),
('foo', 'text'),
('data', 'data text'),
('bar', 'bar text'),
('baz', 'baz text'),
('baz', 'baz text'),
('bar', 'bar text'),
('data', 'data text')]
来源:https://stackoverflow.com/questions/31770189/why-getparent-dont-work-as-expected