How do I skip validating the URI in lxml?

人走茶凉 提交于 2019-12-10 12:13:44

问题


I am using lxml to parse some xml files. I don't create them, I'm just parsing them. Some of the files contain invalid uri's for the namespaces. For instance:

'D:\Path\To\some\local\file.xsl'

I get an error when I try to process it:

lxml.etree.XMLSyntaxError: xmlns:xsi: 'D:\Path\To\some\local\file.xsl' is not a valid URI

Is there an easy way to replace any invalid uri's with something (anything, such as 'http://www.googlefsdfsd.com/')? I thought of writing a regex but was hoping for an easier way.


回答1:


What the parser doesn't like are the backslashes in the namespace uri.

To parse the xml despite the invalid uris, you can instantiate an lxml.etree.XMLParser with the recover argument set to True and then use that to parse the file:

from lxml import etree
recovering_parser = etree.XMLParser(recover=True)
xml = etree.parse("xmlfile.xml", parser=recovering_parser)
...



回答2:


If you are sure that those specific errors are not significant to your use case you could just catch it as an exeption:

try:
   # process your tree here
   SomeFn()

except lxml.etree.XMLSyntaxError, e:
   print "Ignoring", e
   pass


来源:https://stackoverflow.com/questions/18692965/how-do-i-skip-validating-the-uri-in-lxml

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!