问题
I am struggling to come up with a simple solution which iterates over xml data to remove the next element if it is a dplicate of the actual one.
example:
from this "input":
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data2</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data5</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data6</c>
</b>
</root>
I would like to get to this "output":
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
</root>'''
for doing this I came up with the following code:
from lxml import etree
from io import StringIO
xml = '''
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data2</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data5</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data6</c>
</b>
</root>'''
# this is to simulate that above xml was read from a file
file = StringIO(unicode(xml))
# reading the xml from a file
tree = etree.parse(file)
root = tree.getroot()
# iterate over all "b" elements
for element in root.iter('b'):
# checks if the last "b" element has been reached.
# on last element it raises "AttributeError" eception and terminates the for loop
try:
# attributes of actual element
elem_attrib_ACT = element.attrib
# attributes of next element
elem_attrib_NEXT = element.getnext().attrib
except AttributeError:
# if no other element, break
break
print('attributes of ACTUAL elem:', elem_attrib_ACT, 'attributes of NEXT elem:', elem_attrib_NEXT)
if elem_attrib_ACT == elem_attrib_NEXT:
print('next elem is duplicate of actual one -> remove it')
# I would like to remove next element but this approach is not working
# if you uncomment, it removes the elements of "data2" but stops
# how to remove the next duplicate element?
#element.getparent().remove(element.getnext())
else:
print('next elem is not a duplicate of actual')
print('result:')
print(etree.tostring(root))
uncommenting line
#element.getparent().remove(element.getnext())
removes the elements around "data2" but stops execution. the resulting xml is this one:
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data5</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data6</c>
</b>
</root>
my impression is that i "cut the branch on which I am sitting"...
any suggestions how to solve this one?
回答1:
I think your suspicion is correct, if you put a print statement before you break in the except
block you can see it's breaking early because this element has been removed (I think)
<b attrib1="abc" attrib2="def">
<c>data2</c>
</b>
Try using getprevious()
instead of getnext()
. I also updated to use list comprehension to avoid the error on the first element (which of course will raise an exception at the .getprevious()
):
for element in [e for e in root.iter('b')][1:]:
try:
if element.getprevious().attrib == element.attrib:
element.getparent().remove(element)
except:
print 'except '
print etree.tostring(root)
results:
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
</root>
来源:https://stackoverflow.com/questions/32097440/how-to-iterate-through-xml-data-to-remove-next-duplicate-element-using-lxml