Is there a way to recover iterparse on invalid Char values?

问题

I'm using lxml's iterparse to parse some big XML files (3-5Gig). Since some of these files have invalid characters a lxml.etree.XMLSyntaxError is thrown.

When using lxml.etree.parse I can provide a parser which recovers on invalid characters:

parser = lxml.etree.XMLParser(recover=True)
root = lxml.etree.parse(open("myMalformed.xml, parser)

Is there a way to get the same functionality for iterparse?

Edit: Encoding is not an Issue here. There are invalid characters in these XML files which can be sanitized by defining a XMLParser with recover=True. Since I need to use iterparse for this, I can't use a custom parser. So I'm looking for the functionality provided in my snippet above for this here:

context = etree.iterparse(open("myMalformed.xml", events=('end',), tag="Foo") <-- cant recover

回答1:

When you say invalid characters, do you mean unicode characters? If so you can try

lxml.etree.XMLParser(encoding='UTF-8', recover=True)

If you mean malformed XML then this obviously won't work. If you can post your traceback, we can see the nature of the XMLSyntaxError which will provide more information.

来源：https://stackoverflow.com/questions/14934854/is-there-a-way-to-recover-iterparse-on-invalid-char-values

标签

python

lxml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!