How should I deal with an XMLSyntaxError in Python's lxml while parsing a large XML file?

后端 未结 4 1225
死守一世寂寞
死守一世寂寞 2020-12-25 14:38

I\'m trying to parse an XML file that\'s over 2GB with Python\'s lxml library. Unfortunately, the XML file does not have a line telling the character encoding, so I have to

4条回答
  •  醉酒成梦
    2020-12-25 15:06

    Found this thread from Google and while @Michael's answer ultimately lead me to a solution (to my problem at least) I wanted to provide a bit more of a copy/paste answer here for issues that can be solved so simply:

    from lxml import etree
    
    # Create a parser
    parser = etree.XMLParser(recover=True)
    
    parsed_file = etree.parse('/path/to/your/janky/xml/file.xml', parser=parser)
    

    I was facing an issue where I had no control over the XML pre-processing and was being given a file with invalid characters. @Michael's answer goes on to elaborate on a way to approach invalid characters from which recover=True can't address. Fortunately for me, this was enough to keep things moving along.

提交回复
热议问题