How should I deal with an XMLSyntaxError in Python's lxml while parsing a large XML file?

后端未结

关注

 4  1227

死守一世寂寞 2020-12-25 14:38

I\'m trying to parse an XML file that\'s over 2GB with Python\'s lxml library. Unfortunately, the XML file does not have a line telling the character encoding, so I have to

4条回答

情书的邮戳 (楼主)

2020-12-25 15:11

I ran into this too, getting \x16 in data (the unicode 'synchronous idle' or 'SYN' character, displayed in the xml as ^V) which leads to an error when parsing the xml: XMLSyntaxError: PCDATA invalid Char value 22. The 22 is because because ord('\x16') is 22.

The answer from @michael put me on the right track. But some control characters below 32 are fine, like the return or the tab, and a few higher characters are still bad. So:

# Get list of bad characters that would lead to XMLSyntaxError.
# Calculated manually like this:
from lxml import etree
from StringIO import StringIO
BAD = []
for i in range(0, 10000):
    try:
        x = etree.parse(StringIO('%s' % unichr(i)))
    except etree.XMLSyntaxError:
        BAD.append(i)

This leads to a list of 31 characters that can be hardcoded instead of doing the above calculation in code:

BAD = [
    0, 1, 2, 3, 4, 5, 6, 7, 8,
    11, 12,
    14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
    # Two are perfectly valid characters but go wrong for different reasons.
    # 38 is '&' which gives: xmlParseEntityRef: no name.
    # 60 is '<' which gives: StartTag: invalid element namea different error.
]
BAD_BASESTRING_CHARS = [chr(b) for b in BAD]
BAD_UNICODE_CHARS = [unichr(b) for b in BAD]

Then use it like this:

def remove_bad_chars(value):
    # Remove bad control characters.
    if isinstance(value, unicode):
        for char in BAD_UNICODE_CHARS:
            value = value.replace(char, u'')
    elif isinstance(value, basestring):
        for char in BAD_BASESTRING_CHARS:
            value = value.replace(char, '')
    return value

If value is 2 Gigabyte you might need to do this in a more efficient way, but I am ignoring that here, although the question mentions it. In my case, I am the one creating the xml file, but I need to deal with these characters in the original data, so I will use this function before putting data in the xml.

0 讨论(0)

查看其它4个回答