Parsing huge, badly encoded XML files in Python

前端未结

关注

 4  1392

我寻月下人不归 2021-01-11 15:14

I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream

4条回答

无人及你 (楼主)

2021-01-11 15:43
If the problems are actual character encoding problems, rather than malformed XML, the easiest, and probably most efficient, solution is to deal with it at the file reading point. Like this:
```
import codecs
from lxml import etree
events = ("start", "end")
reader = codecs.EncodedFile(xmlfile, 'utf8', 'utf8', 'replace')
context = etree.iterparse(reader, events=events)
```
This will cause the non-UTF8-readable bytes to be replaced by '?'. There are a few other options; see the documentation for the codecs module for more.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...