how to iterate through xml data to remove next duplicate element using lxml

问题

I am struggling to come up with a simple solution which iterates over xml data to remove the next element if it is a dplicate of the actual one.

example:

from this "input":

<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data2</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data5</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data6</c>
    </b>
</root>

I would like to get to this "output":

<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
</root>'''

for doing this I came up with the following code:

from lxml import etree
from io import StringIO


xml = '''
<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data2</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data5</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data6</c>
    </b>
</root>'''

# this is to simulate that above xml was read from a file
file = StringIO(unicode(xml))

# reading the xml from a file
tree = etree.parse(file)
root = tree.getroot()

# iterate over all "b" elements
for element in root.iter('b'):
    # checks if the last "b" element has been reached.
    # on last element it raises "AttributeError" eception and terminates the for loop
    try:
        # attributes of actual element
        elem_attrib_ACT = element.attrib
        # attributes of next element
        elem_attrib_NEXT = element.getnext().attrib
    except AttributeError:
        # if no other element, break
        break
    print('attributes of ACTUAL elem:', elem_attrib_ACT, 'attributes of NEXT elem:', elem_attrib_NEXT)
    if elem_attrib_ACT == elem_attrib_NEXT:
        print('next elem is duplicate of actual one -> remove it')
        # I would like to remove next element but this approach is not working
        # if you uncomment, it removes the elements of "data2" but stops
        # how to remove the next duplicate element?
        #element.getparent().remove(element.getnext())
    else:
        print('next elem is not a duplicate of actual')

print('result:')
print(etree.tostring(root))

uncommenting line

#element.getparent().remove(element.getnext())

removes the elements around "data2" but stops execution. the resulting xml is this one:

<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data5</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data6</c>
    </b>
</root>

my impression is that i "cut the branch on which I am sitting"...

any suggestions how to solve this one?

回答1:

I think your suspicion is correct, if you put a print statement before you break in the except block you can see it's breaking early because this element has been removed (I think)

<b attrib1="abc" attrib2="def">
    <c>data2</c>
</b>

Try using getprevious() instead of getnext(). I also updated to use list comprehension to avoid the error on the first element (which of course will raise an exception at the .getprevious()):

for element in [e for e in root.iter('b')][1:]:
    try:
        if element.getprevious().attrib == element.attrib:
            element.getparent().remove(element)
    except:
        print 'except  '
print etree.tostring(root)

results:

<root>
<b attrib1="abc" attrib2="def">
    <c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
    <c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
    <c>data4</c>
</b>
</root>

来源：https://stackoverflow.com/questions/32097440/how-to-iterate-through-xml-data-to-remove-next-duplicate-element-using-lxml

标签

python

xml

lxml