I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream
If the problems are actual character encoding problems, rather than malformed XML, the easiest, and probably most efficient, solution is to deal with it at the file reading point. Like this:
import codecs
from lxml import etree
events = ("start", "end")
reader = codecs.EncodedFile(xmlfile, 'utf8', 'utf8', 'replace')
context = etree.iterparse(reader, events=events)
This will cause the non-UTF8-readable bytes to be replaced by '?'. There are a few other options; see the documentation for the codecs module for more.