问题
I'm using lxml's iterparse
to parse some big XML files (3-5Gig). Since some of these files have invalid characters a lxml.etree.XMLSyntaxError
is thrown.
When using lxml.etree.parse I can provide a parser which recovers on invalid characters:
parser = lxml.etree.XMLParser(recover=True)
root = lxml.etree.parse(open("myMalformed.xml, parser)
Is there a way to get the same functionality for iterparse?
Edit: Encoding is not an Issue here. There are invalid characters in these XML files which can be sanitized by defining a XMLParser with recover=True. Since I need to use iterparse for this, I can't use a custom parser. So I'm looking for the functionality provided in my snippet above for this here:
context = etree.iterparse(open("myMalformed.xml", events=('end',), tag="Foo") <-- cant recover
回答1:
When you say invalid characters, do you mean unicode characters? If so you can try
lxml.etree.XMLParser(encoding='UTF-8', recover=True)
If you mean malformed XML then this obviously won't work. If you can post your traceback, we can see the nature of the XMLSyntaxError
which will provide more information.
来源:https://stackoverflow.com/questions/14934854/is-there-a-way-to-recover-iterparse-on-invalid-char-values