Parsing huge, badly encoded XML files in Python

前端 未结 4 1392
我寻月下人不归
我寻月下人不归 2021-01-11 15:14

I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream

4条回答
  •  无人及你
    2021-01-11 15:43

    If the problems are actual character encoding problems, rather than malformed XML, the easiest, and probably most efficient, solution is to deal with it at the file reading point. Like this:

    import codecs
    from lxml import etree
    events = ("start", "end")
    reader = codecs.EncodedFile(xmlfile, 'utf8', 'utf8', 'replace')
    context = etree.iterparse(reader, events=events)
    

    This will cause the non-UTF8-readable bytes to be replaced by '?'. There are a few other options; see the documentation for the codecs module for more.

提交回复
热议问题