I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags \"Author\" and \"Content\":
I prefer XPath for such things:
In [1]: from lxml.etree import parse
In [2]: tree = parse('/tmp/database.xml')
In [3]: for post in tree.xpath('/Database/BlogPost'):
...: print 'Author:', post.xpath('Author')[0].text
...: print 'Content:', post.xpath('Content')[0].text
...:
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
I'm not sure if it's different in terms of processing big files, though. Comments about this would be appreciated.
Doing it your way,
for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
for info in element.iter():
if info.tag in ('Author', 'Content'):
print info.tag, ':', info.text