I\'m trying to parse an XML file that\'s over 2GB with Python\'s lxml library. Unfortunately, the XML file does not have a line telling the character encoding, so I have to
Found this thread from Google and while @Michael's answer ultimately lead me to a solution (to my problem at least) I wanted to provide a bit more of a copy/paste answer here for issues that can be solved so simply:
from lxml import etree
# Create a parser
parser = etree.XMLParser(recover=True)
parsed_file = etree.parse('/path/to/your/janky/xml/file.xml', parser=parser)
I was facing an issue where I had no control over the XML pre-processing and was being given a file with invalid characters. @Michael's answer goes on to elaborate on a way to approach invalid characters from which recover=True can't address. Fortunately for me, this was enough to keep things moving along.