Check and remove duplicated children tags in XML

问题

I'm parsing an XML-like file via Element Tree in python and and writing the content to a pandas dataframe.

I'm currently facing the following problem: The existence of children tags will be variant for different tags. This wouldn't be a problem with the solution mentioned here. However, the complicated part is that some tags have duplicated children tags while others don't. For example first product tag has two (different) article numbers and two equal product_types (duplicate) while the second only has one of each.

<main>
    <product>
       <article_nr>B00024J7C6</article_nr>
       <article_nr>44253</article_nr>
       <product_type>x</product_type>
       <product_type>x</product_type>
    </product>

    <product>
       <article_nr>B00024J7C7</article_nr>
       <product_type>y</product_type>
    </product>
</main>

What I'd like to do is: 1.) remove the duplicates for 'product_type' and 2.) set the value NULL if there doesn't exist a second article_nr, otherwise take the value.

My code so far:

def create_dataframe(data):
    df = pd.DataFrame(columns=('article_nr', 'article_nr2', 'product_type', 'product_type2','product_type2'))
    for i in range(len(data)):
        obj = data.getchildren()[i].getchildren()
        row = dict(itertools.izip(['article_nr', 'article_nr2', 'product_type', 'product_type2','product_type2'], 
                       [obj[0].text, obj[1].text, obj[2].text, obj[3].text, obj[4].text]))
        row_s = pd.Series(row)
        row_s.name = i
        df = df.append(row_s)
    return df

This works fine with the first example, but obviously not with the second, because there are no values for the second 'article_nr' and 'product_type'.

Output should be:

article_nr    article_nr    product_type
B00024J7C6    44253           x
B00024J7C7    NULL            y

回答1:

Look at Python remove duplicate elements from xml tree ,maybe it can help you. Some Thing like this:

import xml.etree.ElementTree as ET
path = 'in.xml'
tree = ET.parse(path)
root = tree.getroot()
prev = None

def elements_equal(e1, e2):
    if type(e1) != type(e2):
        return False
    if e1.tag != e1.tag: return False
    if e1.text != e2.text: return False
    if e1.tail != e2.tail: return False
    if e1.attrib != e2.attrib: return False
    if len(e1) != len(e2): return False
    return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])

for page in root:                     # iterate over pages
    elems_to_remove = []
    for elem in page:
        if elements_equal(elem, prev):
            print("found duplicate: %s" % elem.text)   # equal function works well
            elems_to_remove.append(elem)
            continue
        prev = elem
    for elem_to_remove in elems_to_remove:
        page.remove(elem_to_remove)
tree.write("out.xml")

来源：https://stackoverflow.com/questions/37089533/check-and-remove-duplicated-children-tags-in-xml

标签

python

xml

parsing

elementtree