问题
I am using lxml to parse some xml files. I don't create them, I'm just parsing them. Some of the files contain invalid uri's for the namespaces. For instance:
'D:\Path\To\some\local\file.xsl'
I get an error when I try to process it:
lxml.etree.XMLSyntaxError: xmlns:xsi: 'D:\Path\To\some\local\file.xsl' is not a valid URI
Is there an easy way to replace any invalid uri's with something (anything, such as 'http://www.googlefsdfsd.com/')? I thought of writing a regex but was hoping for an easier way.
回答1:
What the parser doesn't like are the backslashes in the namespace uri.
To parse the xml despite the invalid uris, you can instantiate an lxml.etree.XMLParser with the recover
argument set to True
and then use that to parse the file:
from lxml import etree
recovering_parser = etree.XMLParser(recover=True)
xml = etree.parse("xmlfile.xml", parser=recovering_parser)
...
回答2:
If you are sure that those specific errors are not significant to your use case you could just catch it as an exeption:
try:
# process your tree here
SomeFn()
except lxml.etree.XMLSyntaxError, e:
print "Ignoring", e
pass
来源:https://stackoverflow.com/questions/18692965/how-do-i-skip-validating-the-uri-in-lxml