Can Python xml ElementTree parse a very large xml file?

前端 未结 5 2010
误落风尘
误落风尘 2020-12-17 19:45

I\'m trying to parse a large file (> 2GB) of structured markup data and the memory is not enough for this.Which is the optimal way of XML parsing class for this condition.Mo

5条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-12-17 20:35

    As the other answerers told ElementTree is a DOM parser, though it has iterparse() method.

    To reduce the memory footprint I used a real SAX parser. Here is the link I used for my solution. Here's the official doc. Here's my XML:

    
    
        
            
            
            . . .
        
        . . .
    
    

    Here's the code:

    import xml.sax
    
    
    class ModelNameHandler(xml.sax.ContentHandler):
        ENTITY_TAG = "entity"
        STORAGE_TABLE_NAME_ATTR = "storageTableName"
        TABLE_NAME_ATTR = "tableName"
        ATTRIBUTE_TAG = "attribute"
        STORAGE_FIELD_NAME_ATTR = "storageFieldName"
        FIELD_NAME_ATTR = "fieldName"
    
        def __init__(self):
            self.entity_code = None
            self.entity_names = {}
            self.attr_names = {}
    
        def startElement(self, tag, attributes):
            if tag == self.ENTITY_TAG:
                self.entity_code = attributes[self.STORAGE_TABLE_NAME_ATTR]
                entity_name = attributes[self.TABLE_NAME_ATTR]
                self.entity_names[self.entity_code] = entity_name
            elif tag == self.ATTRIBUTE_TAG:
                attr_code = attributes[self.STORAGE_FIELD_NAME_ATTR]
                key = self.entity_code + "." + attr_code
                attr_name = attributes[self.FIELD_NAME_ATTR]
                self.attr_names[key] = attr_name
    
    
    def get_model_names(file):
        parser = xml.sax.make_parser()
        parser.setFeature(xml.sax.handler.feature_namespaces, 0)
        handler = ModelNameHandler()
        parser.setContentHandler(handler)
        parser.parse(file)
    
        return handler.entity_names, handler.attr_names
    

    Works fast enough.

    Just in case, a little bit more details:

    import my_package as p
    
    
    if __name__ == "__main__":
    
        with open('/.xml', 'r', encoding='utf_8') as file:
            entity_names, attr_names = p.get_model_names(file)
    

提交回复
热议问题