How to prevent lxml remove method from removing text between two elements

问题

I'm using lxml and python 2.7 to parse xml files. I need to use remove method to remove an element at some point, but very strangely it removes some text after it as well.

the input xml is:

<ce:para view="all">Web and grid services <ce:cross-refs refid="BIB10 BIB11">[10,11]</ce:cross-refs>, where they can provide rich service descriptions that can help in locating suitable services.</ce:para>

then I need to expand the cross-refs element to multiple cross-ref with separated refid. So the output should be something like this:

<ce:para view="all">Web and grid services <ce:cross-ref refid="BIB10">[10]</ce:cross-ref><ce:cross-ref refid="BIB11">[11]</ce:cross-ref>, where they can provide rich service descriptions that can help in locating suitable services.</ce:para>

and here's the python the code with some abbreviation:

xpath = "//ce:cross-refs"
cross_refs = tree.xpath(xpath, namespaces={'ce': 'http://www.elsevier.com/xml/common/dtd'})
for c in cross_refs:
    c_parent = c.getparent()
    c_values = c.text.strip("[]")
    ...
    ref_ids = c.attrib['refid'].strip().split()
    i = 0
    for r in ref_ids:
        ...
        tag = et.QName(CE, 'cross-ref')
        exploded_cross_refs = et.Element(tag, refid=r, nsmap=NS_MAP)
        exploded_cross_refs.text = "[" + c_values[i] + "]"
        c.addprevious(exploded_cross_refs)
        i += 1
    c_parent.remove(c)

which gets cross-refs element, expand refid values and element text values, and then creates new cross-ref elements and add them before the original cross-refs and finally I want to remove old cross-refs element and my problem is exactly here: When I remove this element, the text between the closing tag and next element gets removed as well, so the final result is like this:

<ce:para view="all">Web and grid services <ce:cross-ref refid="BIB10">[10]</ce:cross-ref><ce:cross-ref refid="BIB11">[11]</ce:cross-ref></ce:para>

Notice that the text between last cross-ref and para element has been removed! How can I fix this issue?

回答1:

Alternatively, especially in case not all elements of certain name within a certain parent need to be removed, we can create simple method that will append the tail to previous element, if any, or append it to the parent's text otherwise, before the element actually get removed :

def remove_preserve_tail(element):
    if element.tail:
        prev = element.getprevious()
        parent = element.getparent()
        if prev is not None:
            prev.tail = (prev.tail or '') + element.tail
        else:
            parent.text = (parent.text or '') + element.tail
    parent.remove(element)

Demo:

>>> from lxml import etree
>>> raw = '''<root>
... foo
... <div></div>has tail and no prev
... <br/><div></div>has tail and prev
... <br/>
... <div>no tail, whitespaces only</div>
... </root>'''
... 
>>> root = etree.fromstring(raw)
>>> divs = root.xpath("//div")
>>> for div in divs:
...     remove_preserve_tail(div)
... 
>>> print etree.tostring(root)
<root>
foo
has tail and no prev
<br/>has tail and prev
<br/>

</root>

回答2:

Well it seems remove method, removes element.tail by default. So I replaced the remove with strip_elements method which takes a with_tail argument, so you have control on removing tail or not. here's the documentation, and here's the command I used:

et.strip_elements(c_parent, 'cross-refs', with_tail=False)

来源：https://stackoverflow.com/questions/37046511/how-to-prevent-lxml-remove-method-from-removing-text-between-two-elements

标签

python

xml

python-2.7

lxml