Python lxml: Ignore XML declaration (errors)

一笑奈何 提交于 2019-12-11 08:42:05

问题


I am trying to parse the file browser Thunar's custom actions files (~/.config/Thunar/uca.xml) with the lxml Python module.

For some reason, Thunar obviously writes a malformed declaration into these files:

<?xml encoding="UTF-8" version="1.0"?>

Obviously, the version is expected to appear as the first "attribute" in the declaration. lxml raises an XMLSyntaxError if I try to parse the file.

And no, I cannot simply correct the declaration, becaue Thunar keeps overwriting it with the bogus one.

This might very likely be a bug in Thunar.

Nevertheless, I would like to know how to ignore the XML declaration with lxml.

I know that I could pre-process the XML document to filter out the XML declaration. But this doesn't seem very elegant. Since XML seems to default to version 1.0 and UTF-8 encoding, there surely is a possibility to just ignore the declaration and assume that in lxml. I didn't find anything in the documentation or on google, I might have overlooked something.


回答1:


I know very little about Thunar, but if it produces the XML declaration in the question, then that is a bug. Having an incorrect XML declaration makes the document ill-formed.

The XML grammar specifies one correct order for the items in the XML declaration. version must come first and encoding second. See http://w3.org/TR/xml/#NT-XMLDecl.

However, with lxml you can parse using a parser instance that has the recover option set to True. It works in this case. The bad XML declaration is ignored.

from lxml import etree 

parser = etree.XMLParser(recover=True)
tree = etree.parse('uca.xml', parser)

See http://lxml.de/api/lxml.etree.XMLParser-class.html



来源:https://stackoverflow.com/questions/44352989/python-lxml-ignore-xml-declaration-errors

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!