How should I deal with an XMLSyntaxError in Python's lxml while parsing a large XML file?

后端 未结 4 1227
死守一世寂寞
死守一世寂寞 2020-12-25 14:38

I\'m trying to parse an XML file that\'s over 2GB with Python\'s lxml library. Unfortunately, the XML file does not have a line telling the character encoding, so I have to

4条回答
  •  情书的邮戳
    2020-12-25 15:11

    I ran into this too, getting \x16 in data (the unicode 'synchronous idle' or 'SYN' character, displayed in the xml as ^V) which leads to an error when parsing the xml: XMLSyntaxError: PCDATA invalid Char value 22. The 22 is because because ord('\x16') is 22.

    The answer from @michael put me on the right track. But some control characters below 32 are fine, like the return or the tab, and a few higher characters are still bad. So:

    # Get list of bad characters that would lead to XMLSyntaxError.
    # Calculated manually like this:
    from lxml import etree
    from StringIO import StringIO
    BAD = []
    for i in range(0, 10000):
        try:
            x = etree.parse(StringIO('

    %s

    ' % unichr(i))) except etree.XMLSyntaxError: BAD.append(i)

    This leads to a list of 31 characters that can be hardcoded instead of doing the above calculation in code:

    BAD = [
        0, 1, 2, 3, 4, 5, 6, 7, 8,
        11, 12,
        14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
        # Two are perfectly valid characters but go wrong for different reasons.
        # 38 is '&' which gives: xmlParseEntityRef: no name.
        # 60 is '<' which gives: StartTag: invalid element namea different error.
    ]
    BAD_BASESTRING_CHARS = [chr(b) for b in BAD]
    BAD_UNICODE_CHARS = [unichr(b) for b in BAD]
    

    Then use it like this:

    def remove_bad_chars(value):
        # Remove bad control characters.
        if isinstance(value, unicode):
            for char in BAD_UNICODE_CHARS:
                value = value.replace(char, u'')
        elif isinstance(value, basestring):
            for char in BAD_BASESTRING_CHARS:
                value = value.replace(char, '')
        return value
    

    If value is 2 Gigabyte you might need to do this in a more efficient way, but I am ignoring that here, although the question mentions it. In my case, I am the one creating the xml file, but I need to deal with these characters in the original data, so I will use this function before putting data in the xml.

提交回复
热议问题