Parsing a large .bz2 file (40 GB) with lxml iterparse in python. Error that does not appear with uncompressed file

时光怂恿深爱的人放手 提交于 2019-11-30 19:51:05

It turns out that the problem is with the compressed planet.osm file.

As indicated on the OSM Wiki, the planet file is compressed as a multistream file, and the bz2 python module cannot read multistream files. However, the bz2 documentation indicates an alternative module that can read such files, bz2file. I used it and it works perfectly!

So the code should read:

from lxml import etree as et
from bz2file import BZ2File

path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
    parser = et.iterparse(xml_file, events=('end',))
    for events, elem in parser:

        if elem.tag == "tag":
            continue
        if elem.tag == "node":
            (do something)


    ## Do some cleaning
    # Get rid of that element
    elem.clear()

    # Also eliminate now-empty references from the root node to node        
    while elem.getprevious() is not None:
        del elem.getparent()[0]

Also, doing some research on using the PBF format (as advised in the comments), I stumbled upon imposm.parser, a python module that implements a generic parser for OSM data (in pbf or xml format). You may want to have a look at this!

As an alternative you can use the output of bzcat command (which can handle multistream files too):

p = subprocess.Popen(["bzcat", "data.bz2"], stdout=subprocess.PIPE)
parser = et.iterparse(p.stdout, ...)
# at the end just check that p.returncode == 0 so there were no errors
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!