How should I deal with an XMLSyntaxError in Python's lxml while parsing a large XML file?

后端 未结 4 1231
死守一世寂寞
死守一世寂寞 2020-12-25 14:38

I\'m trying to parse an XML file that\'s over 2GB with Python\'s lxml library. Unfortunately, the XML file does not have a line telling the character encoding, so I have to

4条回答
  •  没有蜡笔的小新
    2020-12-25 15:28

    The codecs Python module supply an EncodedFile class that works as a wrapper to a file - you should pass an object of this class to lxml, set to replace unknown characters with XML char entities --

    Try doing this:

    from lxml import etree
    import codecs
    
    enc_file = codecs.EncodedFile(file("my_file.xml"), "ASCII", "ASCII", "xmlcharrefreplace")
    
    etparse = etree.iterparse(enc_file, events=("start",), encoding="CP1252")
    ...
    

    The "xmlcharrefreplace" constant passed is the "errors" parameter, and specifies what to do with unknown characters. It could be "strict" (raises an error), "ignore" (leave as is), "replace" (replaces char with "?"), "xmlrefreplace" (creates an "&#xxxx;" xml reference) or "backslahreplace" (creates a Python valid backslash reference). For more information, check: http://docs.python.org/library/codecs.html

提交回复
热议问题