lxml removing <?xml …> tags when parsing?

喜你入骨 提交于 2019-12-11 04:04:08

问题


I'm currently working with parsing XML documents (adding elements, adding attributes, etc). So I first need to parse the XML in before working on it. However, lxml seems to be removing the element <?xml ...>. For example

from lxml import etree

tree = etree.fromstring('<?xml version="1.0" encoding="utf-8"?><dmodule>test</dmodule>', etree.XMLParser())
print etree.tostring(tree)

will result in

<dmodule>test</dmodule>

Does anyone know why the <?xml ...> element is being removed? I thought encoding tags were valid XML. Thanks for your time.


回答1:


The <?xml> element is an XML declaration, so it's not strictly an element. It just gives info about the XML tree below it.

If you need to print it out with lxml, there is some info here about the xmlDeclaration=TRUE flag you can use.

http://lxml.de/api.html#serialisation

etree.tostring(tree, xml_declaration=True)



回答2:


Does anyone know why the <?xml ...> element is being removed?

XML defaults to version 1.0 in UTF-8 so the document is equivalent if you remove them.

You are parsing some XML to a data structure and then converting that data structure back to XML. You will get a representation of that data structure in XML, but it might not be expressed in the same way (so the prolog can be removed and <foo /> can be exchanged with <foo></foo> and so on).



来源:https://stackoverflow.com/questions/3232252/lxml-removing-xml-tags-when-parsing

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!