问题
E.g. consider parsing a pom.xml
file:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<parent>
<groupId>com.parent</groupId>
<artifactId>parent</artifactId>
<version>1.0-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
</parent>
<modelVersion>2.0.0</modelVersion>
<groupId>com.parent.somemodule</groupId>
<artifactId>some_module</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>Some Module</name>
...
Code:
import xml.etree.ElementTree as ET
tree = ET.parse(pom)
root = tree.getroot()
groupId = root.find("groupId")
artifactId = root.find("artifactId")
Both groupId
and artifactId
are None
. Why when they are the direct descendants of the root? I tried to replace the root
with tree
(groupId = tree.find("groupId")
) but that didn't change anything.
回答1:
The problem is that you don't have a child named groupId
, you have a child named {http://maven.apache.org/POM/4.0.0}groupId
, because etree doesn't ignore XML namespaces, it uses "universal names". See Working with Namespaces and Qualified Names in the effbot docs.
回答2:
Just to expand on abarnert's comment about BeautifulSoup, if you DO just want a quick and dirty solution to the problem, this is probably the fastest way to go about it. I have implemented this (for a personal script) that uses bs4, where you can traverse the tree with
element = dom.getElementsByTagNameNS('*','elementname')
This will reference the dom using ANY namespace, handy if you know you've only got one in the file so there's no ambiguity.
来源:https://stackoverflow.com/questions/21146417/simple-dom-traversing-in-python-using-xml-etree-elementtree