using lxml and iterparse() to parse a big (+- 1Gb) XML file

后端 未结 3 2105
我寻月下人不归
我寻月下人不归 2020-11-27 17:25

I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags \"Author\" and \"Content\":


    

        
3条回答
  •  没有蜡笔的小新
    2020-11-27 18:02

    I prefer XPath for such things:

    In [1]: from lxml.etree import parse
    
    In [2]: tree = parse('/tmp/database.xml')
    
    In [3]: for post in tree.xpath('/Database/BlogPost'):
       ...:     print 'Author:', post.xpath('Author')[0].text
       ...:     print 'Content:', post.xpath('Content')[0].text
       ...: 
    Author: Last Name, Name
    Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
    Author: Last Name, Name
    Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
    Author: Last Name, Name
    Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
    

    I'm not sure if it's different in terms of processing big files, though. Comments about this would be appreciated.

    Doing it your way,

    for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
         for info in element.iter():
             if info.tag in ('Author', 'Content'):
                 print info.tag, ':', info.text
    

提交回复
热议问题