How to properly parse parent/child XML with Python

问题

I have a XML parsing issue that I have been working on for the last few days and I just can't figure it out. I've used both the ElementTree built-in to Python as well as the LXML libraries but get the same results. I would like to continue using ElementTree if I can, but if there are limitations to that library then LXML would do. Please see the following XML example. What I am trying to do is find a connection element and see what classes that element contains. I am expecting each connection to contain at least one class. If it doesn't have at least one class I want to know that it doesn't. The problem I am facing is that my code is returning ALL THE CLASSES in the document for each connection, instead of only the classes for that specific connection.

<test>
  <connections>
    <connection>
      <id>10</id>
      <classes>
        <class>
          <classname>DVD</classname>
        </class>
        <class>
          <classname>DVD_TEST</classname>
        </class>
      </classes>
    </connection>
    <connection>
      <id>20</id>
      <classes>
        <class>
          <classname>TV</classname>
        </class>
      </classes>
    </connection>
  </connections>
</test>

For example, here is my Python code and the output that it returns:

            for parentConnection in elemetTree.getiterator('connection'):
                # print parentConnection.tag
                for childConnection in parentConnection:
                    # print childConnection.text
                    if childConnection.tag == 'id':
                        connID = childConnection.text
                        print connID
                for p in tree.xpath('./connections/connection/classes/class'):
                    for attrib in p.attrib:
                        print '@' + attrib + '=' + p.attrib[attrib]

                    children = p.getchildren()
                    for child in children:
                        print child.text

Here is the output:

10
DVD
DVD_TEST
TV

20
DVD
DVD_TEST
TV

As you can see, I am printing out the text of the CONNECTION ID and then the text for each CLASSNAME. However, as you can see, they both contain the same text for CLASSNAME. The output should really look like this:

10
DVD
DVD_TEST

20
TV

Now as the above hand modified example shows each connection ID (Parent) has the appropriate classes/classnames (children). I just can't figure out how to make this work. If any of you have the knowledge to make this work, I would love to hear it.

I've tried building a data structure and other examples on this forum but just can't get it to work right.

回答1:

My solution without using xpath. What I recommend is digging a little further into lxml documentation. There might be more elegant and direct ways to achieve this. There's a lot to explore!.

Solution:

from lxml import etree
from io import BytesIO


class FindClasses(object):
    @staticmethod
    def parse_xml(xml_string):
        parser = etree.XMLParser()
        fs = etree.parse(BytesIO(xml_string), parser)
        fstring = etree.tostring(fs, pretty_print=True)
        element = etree.fromstring(fstring)
        return element

    def find(self, xml_string):
        for parent in self.parse_xml(xml_string).getiterator('connection'):
            for child in parent:
                if child.tag == 'id':
                    print child.text
                    self.find_classes(child)

    @staticmethod
    def find_classes(child):
        for parent in child.getparent():  # traversing up -> connection
            for children in parent.getchildren():  # children of connection -> classes
                for child in children.getchildren():  # child of classes -> class
                    print child.text
        print

if __name__ == '__main__':
    xml_file = open('foo.xml', 'rb')  #foo.xml or path to your xml file
    xml = xml_file.read()
    f = FindClasses()
    f.find(xml)

Output:

10
DVD
DVD_TEST

20
TV

回答2:

Your problem is with your xpath expression. It does not understand the logic from your nested for loop. The result of:

tree.xpath('./connections/connection/classes/class')

is a list of every element that follows that pattern provided to the xpath. In this case, all of your <class> elements follow this pattern are selected (this is actually the incredible power of xpath that it can select all of those nodes when you store your data this way).

来源：https://stackoverflow.com/questions/20807360/how-to-properly-parse-parent-child-xml-with-python

标签

python-2.7

xml-parsing

parent-child

lxml

elementtree