问题
I'm currently working with parsing XML documents (adding elements, adding attributes, etc). So I first need to parse the XML in before working on it. However, lxml
seems to be removing the element <?xml ...>
. For example
from lxml import etree
tree = etree.fromstring('<?xml version="1.0" encoding="utf-8"?><dmodule>test</dmodule>', etree.XMLParser())
print etree.tostring(tree)
will result in
<dmodule>test</dmodule>
Does anyone know why the <?xml ...>
element is being removed? I thought encoding tags were valid XML. Thanks for your time.
回答1:
The <?xml>
element is an XML declaration, so it's not strictly an element. It just gives info about the XML tree below it.
If you need to print it out with lxml, there is some info here about the xmlDeclaration=TRUE
flag you can use.
http://lxml.de/api.html#serialisation
etree.tostring(tree, xml_declaration=True)
回答2:
Does anyone know why the
<?xml ...>
element is being removed?
XML defaults to version 1.0 in UTF-8 so the document is equivalent if you remove them.
You are parsing some XML to a data structure and then converting that data structure back to XML. You will get a representation of that data structure in XML, but it might not be expressed in the same way (so the prolog can be removed and <foo />
can be exchanged with <foo></foo>
and so on).
来源:https://stackoverflow.com/questions/3232252/lxml-removing-xml-tags-when-parsing