Can Python xml ElementTree parse a very large xml file?

前端未结

关注

 5  2010

误落风尘 2020-12-17 19:45

I\'m trying to parse a large file (> 2GB) of structured markup data and the memory is not enough for this.Which is the optimal way of XML parsing class for this condition.Mo

5条回答

小蘑菇 (楼主)

2020-12-17 20:35

As the other answerers told ElementTree is a DOM parser, though it has iterparse() method.

To reduce the memory footprint I used a real SAX parser. Here is the link I used for my solution. Here's the official doc. Here's my XML:

Here's the code:

import xml.sax


class ModelNameHandler(xml.sax.ContentHandler):
    ENTITY_TAG = "entity"
    STORAGE_TABLE_NAME_ATTR = "storageTableName"
    TABLE_NAME_ATTR = "tableName"
    ATTRIBUTE_TAG = "attribute"
    STORAGE_FIELD_NAME_ATTR = "storageFieldName"
    FIELD_NAME_ATTR = "fieldName"

    def __init__(self):
        self.entity_code = None
        self.entity_names = {}
        self.attr_names = {}

    def startElement(self, tag, attributes):
        if tag == self.ENTITY_TAG:
            self.entity_code = attributes[self.STORAGE_TABLE_NAME_ATTR]
            entity_name = attributes[self.TABLE_NAME_ATTR]
            self.entity_names[self.entity_code] = entity_name
        elif tag == self.ATTRIBUTE_TAG:
            attr_code = attributes[self.STORAGE_FIELD_NAME_ATTR]
            key = self.entity_code + "." + attr_code
            attr_name = attributes[self.FIELD_NAME_ATTR]
            self.attr_names[key] = attr_name


def get_model_names(file):
    parser = xml.sax.make_parser()
    parser.setFeature(xml.sax.handler.feature_namespaces, 0)
    handler = ModelNameHandler()
    parser.setContentHandler(handler)
    parser.parse(file)

    return handler.entity_names, handler.attr_names

Works fast enough.

Just in case, a little bit more details:

import my_package as p


if __name__ == "__main__":

    with open('/.xml', 'r', encoding='utf_8') as file:
        entity_names, attr_names = p.get_model_names(file)

0 讨论(0)

查看其它5个回答