问题
I'm parsing an XML-like file via Element Tree in python and and writing the content to a pandas dataframe.
I'm currently facing the following problem: The existence of children tags will be variant for different tags. This wouldn't be a problem with the solution mentioned here. However, the complicated part is that some tags have duplicated children tags while others don't. For example first product tag has two (different) article numbers and two equal product_types (duplicate) while the second only has one of each.
<main>
<product>
<article_nr>B00024J7C6</article_nr>
<article_nr>44253</article_nr>
<product_type>x</product_type>
<product_type>x</product_type>
</product>
<product>
<article_nr>B00024J7C7</article_nr>
<product_type>y</product_type>
</product>
</main>
What I'd like to do is: 1.) remove the duplicates for 'product_type' and 2.) set the value NULL if there doesn't exist a second article_nr, otherwise take the value.
My code so far:
def create_dataframe(data):
df = pd.DataFrame(columns=('article_nr', 'article_nr2', 'product_type', 'product_type2','product_type2'))
for i in range(len(data)):
obj = data.getchildren()[i].getchildren()
row = dict(itertools.izip(['article_nr', 'article_nr2', 'product_type', 'product_type2','product_type2'],
[obj[0].text, obj[1].text, obj[2].text, obj[3].text, obj[4].text]))
row_s = pd.Series(row)
row_s.name = i
df = df.append(row_s)
return df
This works fine with the first example, but obviously not with the second, because there are no values for the second 'article_nr' and 'product_type'.
Output should be:
article_nr article_nr product_type
B00024J7C6 44253 x
B00024J7C7 NULL y
回答1:
Look at Python remove duplicate elements from xml tree ,maybe it can help you. Some Thing like this:
import xml.etree.ElementTree as ET
path = 'in.xml'
tree = ET.parse(path)
root = tree.getroot()
prev = None
def elements_equal(e1, e2):
if type(e1) != type(e2):
return False
if e1.tag != e1.tag: return False
if e1.text != e2.text: return False
if e1.tail != e2.tail: return False
if e1.attrib != e2.attrib: return False
if len(e1) != len(e2): return False
return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])
for page in root: # iterate over pages
elems_to_remove = []
for elem in page:
if elements_equal(elem, prev):
print("found duplicate: %s" % elem.text) # equal function works well
elems_to_remove.append(elem)
continue
prev = elem
for elem_to_remove in elems_to_remove:
page.remove(elem_to_remove)
tree.write("out.xml")
来源:https://stackoverflow.com/questions/37089533/check-and-remove-duplicated-children-tags-in-xml