How I do capture all of the element names of an XML file using LXML in Python?

问题

I am able to use lxml to accomplish most of what I would like to do, although it was a struggle to go through the obfuscating examples and tutorials. In short, I am able to read an external xml file and import it via lxml into the proper tree-like format.

To demonstrate this, if I were to type:

print(etree.tostring(myXmlTree, pretty_print= True, method= "xml") )

I get the following output:

<net xmlns="http://www.arin.net/whoisrws/core/v1" xmlns:ns2="http://www.arin.net/whoisrws/rdns/v1" xmlns:ns3="http://www.arin.net/whoisrws/netref/v2" termsOfUse="https://www.arin.net/whois_tou.html">
 <registrationDate>2006-08-29T00:00:00-04:00</registrationDate>
 <ref>http://whois.arin.net/rest/net/NET-79-0-0-0-1</ref>
 <endAddress>79.255.255.255</endAddress>
 <handle>NET-79-0-0-0-1</handle>
 <name>79-RIPE</name>
 <netBlocks>
  <netBlock>
   <cidrLength>8</cidrLength>
   <endAddress>79.255.255.255</endAddress>
   <description>Allocated to RIPE NCC</description>
   <type>RN</type>
   <startAddress>79.0.0.0</startAddress>
  </netBlock>
 </netBlocks>
 <orgRef name="RIPE Network Coordination Centre" handle="RIPE">http://whois.arin.net/rest/org/RIPE</orgRef>
 <comment>
  <line number="0">These addresses have been further assigned to users in</line>
  <line number="1">the RIPE NCC region. Contact information can be found in</line>
  <line number="2">the RIPE database at http://www.ripe.net/whois</line>
 </comment>
 <startAddress>79.0.0.0</startAddress>
 <updateDate>2009-05-18T07:34:02-04:00</updateDate>
 <version>4</version>
</net>

OK, that's great for human consumption, but not useful for machines. If I'd wanted particular elements, like say the start and end IP addresses in the xml, I could type:

ns = myXmlTree.nsmap.values()[0]
myXmlTree.findall("{" + ns + "}startAddress")[0].text
myXmlTree.findall("{" + ns + "}endAddress")[0].text

and I would receive:

'79.0.0.0'
'79.255.255.255'

But I still need to LOOK at the xml file as a human to know what elements are there. Instead, I would like to be able to retrieve the names of ALL of the elements at a particular level and then automatically traverse that level. So, for instance, I'd like to do something like:

myElements = myXmlTree.findallelements("{" + ns + "}")

and it would give me a return value something like:

['registrationDate', 'ref', 'endAddress', 'handle', 'name', 'netBlocks', 'orgRef', 'comment', 'startAddress', 'updateDate', 'version']

Especially awesome would be if it could tell me the entire structure of elements, including the nested ones.

I'm SURE there's a way, as it wouldn't make sense otherwise.

Thanks in advance!!

P.S., I know that I can iterate and go through the list of all iterations. I was hoping there was already a method within lxml that had these data. If iteration is the only way, I guess that's OK... it just seems clunky to me.

回答1:

I believe you are looking for element.xpath().

XPath is not a concept introduced by lxml but a general query language for selecting nodes from an XML document supported by many things that deal with XML. Think of it as something similar to CSS selectors, but more powerful (also a bit more complicated). See XPath Syntax.

Your document uses namespaces - I'll ignore that for now and explain at the end of the post how to deal with them, because it keeps the examples more readable that way. (But they won't work as-is for your document).

So, for example,

tree.xpath('/net/endAddress')

would select the <endAddress>79.255.255.255</endAddress> element direcly below the <net /> node. But not the <endAddress /> inside the <netBlock>.

The XPath expression

tree.xpath('//endAddress')

however would select all <endAddress /> nodes anywhere in the document.

You can of course further query the nodes you get back with XPath epxressions:

netblocks = tree.xpath('/net/netBlocks/netBlock')
for netblock in netblocks:
    start = netblock.xpath('./startAddress/text()')[0]
    end = netblock.xpath('./endAddress/text()')[0]
    print "%s - %s" % (start, end)

would give you

79.0.0.0 - 79.255.255.255

Notice that .xpath() always returns a list of selected nodes - so if you want just one, account for that.

You can also select elements by their attributes:

comment = tree.xpath('/net/comment')[0]
line_2 = comment.xpath("./line[@number='2']")[0]

This would select the <line /> element with number="2" from the first comment.

You can also select attributes themselves:

numbers = tree.xpath('//line/attribute::number')

['0', '1', '2']

To get the list of element names you asked about last, you could do something likes this:

names = [node.tag for node in tree.xpath('/net/*')]

['registrationDate', 'ref', 'endAddress', 'handle', 'name', 'netBlocks', 'orgRef', 'comment', 'startAddress', 'updateDate', 'version']

But given the power of XPath, it's probably better to just query the document for what you want to know from it, as specific or loose as you see fit.

Now, namespaces. As you noticed, if your document uses XML namespaces, you need to take that into consideration in many places, and XPath is no different. When querying a namespaced document, you pass the xpath() method the namespace map like this:

NSMAP = {'ns':  'http://www.arin.net/whoisrws/core/v1',
         'ns2': 'http://www.arin.net/whoisrws/rdns/v1',
         'ns3': 'http://www.arin.net/whoisrws/netref/v2'}

names = [node.tag for node in tree.xpath('/ns:net/*', namespaces=NSMAP)]

In many other places in lxml you can speficy the default namespace by using None as the dictionary key in the namespace map. Not with xpath() unfortunately, that will raise an exception

TypeError: empty namespace prefix is not supported in XPath

So you unfortunately have to prefix every node name in your XPath expression with ns: (or whatever you choose to map that namespace to).

For more information on the XPath syntax, see for example the XPath Syntax page in the W3Schools Xpath Tutorial.

To get going with XPath it can also be very helpful to fiddle around with your document in one of the many XPath testers. Also, the Firebug plugin for Firefox, or Google Chrome inspector allow you to show the (or rather, one of many) XPath for the selected element.

来源：https://stackoverflow.com/questions/19456562/how-i-do-capture-all-of-the-element-names-of-an-xml-file-using-lxml-in-python

标签

python

xml

lxml