Parsing huge, badly encoded XML files in Python

前端 未结 4 1398
我寻月下人不归
我寻月下人不归 2021-01-11 15:14

I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream

4条回答
  •  长情又很酷
    2021-01-11 15:42

    Since the problem is being caused by illegal XML characters, in this case the 0x19 byte, I decided to strip them off. I found the following regular expression on this site:

    invalid_xml = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]')
    

    And I wrote this piece of code that removes illegal bytes while saving an xml feed:

    conn = urllib2.urlopen(xmlfeed)
    xmlfile = open('output', 'w')
    
    while True:
        data = conn.read(4096)
        if data:
            newdata, count = invalid_xml.subn('', data)
            if count > 0 :
                print 'Removed %s illegal characters from XML feed' % count
            xmlfile.write(newdata)
    
        else:
            break
    
    xmlfile.close()
    

提交回复
热议问题