I\'m trying to parse an XML file that\'s over 2GB with Python\'s lxml library. Unfortunately, the XML file does not have a line telling the character encoding, so I have to
I ran into this too, getting \x16 in data (the unicode 'synchronous idle' or 'SYN' character, displayed in the xml as ^V) which leads to an error when parsing the xml: XMLSyntaxError: PCDATA invalid Char value 22. The 22 is because because ord('\x16') is 22.
The answer from @michael put me on the right track. But some control characters below 32 are fine, like the return or the tab, and a few higher characters are still bad. So:
# Get list of bad characters that would lead to XMLSyntaxError.
# Calculated manually like this:
from lxml import etree
from StringIO import StringIO
BAD = []
for i in range(0, 10000):
try:
x = etree.parse(StringIO('%s
' % unichr(i)))
except etree.XMLSyntaxError:
BAD.append(i)
This leads to a list of 31 characters that can be hardcoded instead of doing the above calculation in code:
BAD = [
0, 1, 2, 3, 4, 5, 6, 7, 8,
11, 12,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
# Two are perfectly valid characters but go wrong for different reasons.
# 38 is '&' which gives: xmlParseEntityRef: no name.
# 60 is '<' which gives: StartTag: invalid element namea different error.
]
BAD_BASESTRING_CHARS = [chr(b) for b in BAD]
BAD_UNICODE_CHARS = [unichr(b) for b in BAD]
Then use it like this:
def remove_bad_chars(value):
# Remove bad control characters.
if isinstance(value, unicode):
for char in BAD_UNICODE_CHARS:
value = value.replace(char, u'')
elif isinstance(value, basestring):
for char in BAD_BASESTRING_CHARS:
value = value.replace(char, '')
return value
If value is 2 Gigabyte you might need to do this in a more efficient way, but I am ignoring that here, although the question mentions it. In my case, I am the one creating the xml file, but I need to deal with these characters in the original data, so I will use this function before putting data in the xml.