Parsing a large .bz2 file (40 GB) with lxml iterparse in python. Error that does not appear with uncompressed file

后端 未结 2 1399
孤城傲影
孤城傲影 2020-12-18 00:49

I am trying to parse OpenStreetMap\'s planet.osm, compressed in bz2 format. Because it is already 41G, I don\'t want to decompress the file completely.

So I figured

2条回答
  •  旧时难觅i
    2020-12-18 00:57

    It turns out that the problem is with the compressed planet.osm file.

    As indicated on the OSM Wiki, the planet file is compressed as a multistream file, and the bz2 python module cannot read multistream files. However, the bz2 documentation indicates an alternative module that can read such files, bz2file. I used it and it works perfectly!

    So the code should read:

    from lxml import etree as et
    from bz2file import BZ2File
    
    path = "where/my/fileis.osm.bz2"
    with BZ2File(path) as xml_file:
        parser = et.iterparse(xml_file, events=('end',))
        for events, elem in parser:
    
            if elem.tag == "tag":
                continue
            if elem.tag == "node":
                (do something)
    
    
        ## Do some cleaning
        # Get rid of that element
        elem.clear()
    
        # Also eliminate now-empty references from the root node to node        
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    

    Also, doing some research on using the PBF format (as advised in the comments), I stumbled upon imposm.parser, a python module that implements a generic parser for OSM data (in pbf or xml format). You may want to have a look at this!

提交回复
热议问题