Parsing huge, badly encoded XML files in Python

前端 未结 4 1393
我寻月下人不归
我寻月下人不归 2021-01-11 15:14

I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream

4条回答
  •  陌清茗
    陌清茗 (楼主)
    2021-01-11 15:46

    I had a similar problem with char "" in my xml file, which is also invalid xmlchar. This is because in the xml version 1.0, the characters like �,  are not allowed. And the rule is that all character composition as regular expression '&#x[0-1]?[0-9A-E]' are not allowed. My purpose it to correct the invalid char in a huge xml file, based on Rik's answer, I improved it as below :

    import re
    
    invalid_xml = re.compile(r'&#x[0-1]?[0-9a-eA-E];')
    
    new_file = open('new_file.xml','w') 
    with open('old_file.xml') as f:
        for line in f:
            nline, count = invalid_xml.subn('',line)
            new_file.write(nline) 
    new_file.close()
    

提交回复
热议问题