I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream
Since the problem is being caused by illegal XML characters, in this case the 0x19 byte, I decided to strip them off. I found the following regular expression on this site:
invalid_xml = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]')
And I wrote this piece of code that removes illegal bytes while saving an xml feed:
conn = urllib2.urlopen(xmlfeed)
xmlfile = open('output', 'w')
while True:
data = conn.read(4096)
if data:
newdata, count = invalid_xml.subn('', data)
if count > 0 :
print 'Removed %s illegal characters from XML feed' % count
xmlfile.write(newdata)
else:
break
xmlfile.close()