How should I deal with an XMLSyntaxError in Python's lxml while parsing a large XML file?

后端未结

关注

 4  1225

死守一世寂寞 2020-12-25 14:38

I\'m trying to parse an XML file that\'s over 2GB with Python\'s lxml library. Unfortunately, the XML file does not have a line telling the character encoding, so I have to

4条回答

醉酒成梦 (楼主)

2020-12-25 15:06
Found this thread from Google and while @Michael's answer ultimately lead me to a solution (to my problem at least) I wanted to provide a bit more of a copy/paste answer here for issues that can be solved so simply:
```
from lxml import etree

# Create a parser
parser = etree.XMLParser(recover=True)

parsed_file = etree.parse('/path/to/your/janky/xml/file.xml', parser=parser)
```
I was facing an issue where I had no control over the XML pre-processing and was being given a file with invalid characters. @Michael's answer goes on to elaborate on a way to approach invalid characters from which recover=True can't address. Fortunately for me, this was enough to keep things moving along.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...