I am trying to parse OpenStreetMap\'s planet.osm, compressed in bz2 format. Because it is already 41G, I don\'t want to decompress the file completely.
So I figured
It turns out that the problem is with the compressed planet.osm file.
As indicated on the OSM Wiki, the planet file is compressed as a multistream file, and the bz2 python module cannot read multistream files. However, the bz2 documentation indicates an alternative module that can read such files, bz2file. I used it and it works perfectly!
So the code should read:
from lxml import etree as et
from bz2file import BZ2File
path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
parser = et.iterparse(xml_file, events=('end',))
for events, elem in parser:
if elem.tag == "tag":
continue
if elem.tag == "node":
(do something)
## Do some cleaning
# Get rid of that element
elem.clear()
# Also eliminate now-empty references from the root node to node
while elem.getprevious() is not None:
del elem.getparent()[0]
Also, doing some research on using the PBF format (as advised in the comments), I stumbled upon imposm.parser, a python module that implements a generic parser for OSM data (in pbf or xml format). You may want to have a look at this!
As an alternative you can use the output of bzcat
command (which can handle multistream files too):
p = subprocess.Popen(["bzcat", "data.bz2"], stdout=subprocess.PIPE)
parser = et.iterparse(p.stdout, ...)
# at the end just check that p.returncode == 0 so there were no errors