问题
I'm using lxml and python 2.7 to parse xml files. I need to use remove method to remove an element at some point, but very strangely it removes some text after it as well.
the input xml is:
<ce:para view="all">Web and grid services <ce:cross-refs refid="BIB10 BIB11">[10,11]</ce:cross-refs>, where they can provide rich service descriptions that can help in locating suitable services.</ce:para>
then I need to expand the cross-refs
element to multiple cross-ref
with separated refid
. So the output should be something like this:
<ce:para view="all">Web and grid services <ce:cross-ref refid="BIB10">[10]</ce:cross-ref><ce:cross-ref refid="BIB11">[11]</ce:cross-ref>, where they can provide rich service descriptions that can help in locating suitable services.</ce:para>
and here's the python the code with some abbreviation:
xpath = "//ce:cross-refs"
cross_refs = tree.xpath(xpath, namespaces={'ce': 'http://www.elsevier.com/xml/common/dtd'})
for c in cross_refs:
c_parent = c.getparent()
c_values = c.text.strip("[]")
...
ref_ids = c.attrib['refid'].strip().split()
i = 0
for r in ref_ids:
...
tag = et.QName(CE, 'cross-ref')
exploded_cross_refs = et.Element(tag, refid=r, nsmap=NS_MAP)
exploded_cross_refs.text = "[" + c_values[i] + "]"
c.addprevious(exploded_cross_refs)
i += 1
c_parent.remove(c)
which gets cross-refs
element, expand refid
values and element text values, and then creates new cross-ref
elements and add them before the original cross-refs
and finally I want to remove old cross-refs
element and my problem is exactly here: When I remove this element, the text between the closing tag and next element gets removed as well, so the final result is like this:
<ce:para view="all">Web and grid services <ce:cross-ref refid="BIB10">[10]</ce:cross-ref><ce:cross-ref refid="BIB11">[11]</ce:cross-ref></ce:para>
Notice that the text between last cross-ref
and para
element has been removed! How can I fix this issue?
回答1:
Alternatively, especially in case not all elements of certain name within a certain parent need to be removed, we can create simple method that will append the tail to previous element, if any, or append it to the parent's text otherwise, before the element actually get removed :
def remove_preserve_tail(element):
if element.tail:
prev = element.getprevious()
parent = element.getparent()
if prev is not None:
prev.tail = (prev.tail or '') + element.tail
else:
parent.text = (parent.text or '') + element.tail
parent.remove(element)
Demo:
>>> from lxml import etree
>>> raw = '''<root>
... foo
... <div></div>has tail and no prev
... <br/><div></div>has tail and prev
... <br/>
... <div>no tail, whitespaces only</div>
... </root>'''
...
>>> root = etree.fromstring(raw)
>>> divs = root.xpath("//div")
>>> for div in divs:
... remove_preserve_tail(div)
...
>>> print etree.tostring(root)
<root>
foo
has tail and no prev
<br/>has tail and prev
<br/>
</root>
回答2:
Well it seems remove
method, removes element.tail
by default. So I replaced the remove
with strip_elements
method which takes a with_tail
argument, so you have control on removing tail or not. here's the documentation, and here's the command I used:
et.strip_elements(c_parent, 'cross-refs', with_tail=False)
来源:https://stackoverflow.com/questions/37046511/how-to-prevent-lxml-remove-method-from-removing-text-between-two-elements