I\'m trying to parse a large file (> 2GB) of structured markup data and the memory is not enough for this.Which is the optimal way of XML parsing class for this condition.Mo
As the other answerers told ElementTree
is a DOM parser, though it has iterparse() method.
To reduce the memory footprint I used a real SAX parser. Here is the link I used for my solution. Here's the official doc. Here's my XML:
. . .
. . .
Here's the code:
import xml.sax
class ModelNameHandler(xml.sax.ContentHandler):
ENTITY_TAG = "entity"
STORAGE_TABLE_NAME_ATTR = "storageTableName"
TABLE_NAME_ATTR = "tableName"
ATTRIBUTE_TAG = "attribute"
STORAGE_FIELD_NAME_ATTR = "storageFieldName"
FIELD_NAME_ATTR = "fieldName"
def __init__(self):
self.entity_code = None
self.entity_names = {}
self.attr_names = {}
def startElement(self, tag, attributes):
if tag == self.ENTITY_TAG:
self.entity_code = attributes[self.STORAGE_TABLE_NAME_ATTR]
entity_name = attributes[self.TABLE_NAME_ATTR]
self.entity_names[self.entity_code] = entity_name
elif tag == self.ATTRIBUTE_TAG:
attr_code = attributes[self.STORAGE_FIELD_NAME_ATTR]
key = self.entity_code + "." + attr_code
attr_name = attributes[self.FIELD_NAME_ATTR]
self.attr_names[key] = attr_name
def get_model_names(file):
parser = xml.sax.make_parser()
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
handler = ModelNameHandler()
parser.setContentHandler(handler)
parser.parse(file)
return handler.entity_names, handler.attr_names
Works fast enough.
Just in case, a little bit more details:
import my_package as p
if __name__ == "__main__":
with open('/.xml', 'r', encoding='utf_8') as file:
entity_names, attr_names = p.get_model_names(file)