I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream
I had a similar problem with char "" in my xml file, which is also invalid xmlchar. This is because in the xml version 1.0, the characters like , are not allowed. And the rule is that all character composition as regular expression '[0-1]?[0-9A-E]' are not allowed. My purpose it to correct the invalid char in a huge xml file, based on Rik's answer, I improved it as below :
import re
invalid_xml = re.compile(r'[0-1]?[0-9a-eA-E];')
new_file = open('new_file.xml','w')
with open('old_file.xml') as f:
for line in f:
nline, count = invalid_xml.subn('',line)
new_file.write(nline)
new_file.close()