How to obtain random access of a gzip compressed file

不问归期 提交于 2020-05-27 06:25:26

问题


According to this FAQ on zlib.net it is possible to:

access data randomly in a compressed stream

I know about the module Bio.bgzf of Biopyton 1.60, which:

supports reading and writing BGZF files (Blocked GNU Zip Format), a variant of GZIP with efficient random access, most commonly used as part of the BAM file format and in tabix. This uses Python’s zlib library internally, and provides a simple interface like Python’s gzip library.

But for my use case I don't want to use that format. Basically I want something, which emulates the code below:

import gzip
large_integer_new_line_start = 10**9
with gzip.open('large_file.gz','rt') as f:
    f.seek(large_integer_new_line_start)

but with the efficiency offered by the native zlib.net to provide random access to the compressed stream. How do I leverage that random access capability in Python?


回答1:


I gave up on doing random access on a gzipped file using Python. Instead I converted my gzipped file to a block gzipped file with a block compression/decompression utility on the command line:

zcat large_file.gz | bgzip > large_file.bgz

Then I used BioPython and tell to get the virtual_offset of line number 1 million of the bgzipped file. And then I was able to rapidly seek the virtual_offset afterwards:

from Bio import bgzf

file='large_file.bgz'

handle = bgzf.BgzfReader(file)
for i in range(10**6):
    handle.readline()
virtual_offset = handle.tell()
line1 = handle.readline()
handle.close()

handle = bgzf.BgzfReader(file)
handle.seek(virtual_offset)
line2 = handle.readline()
handle.close()

assert line1==line2

I would like to also point to the SO answer by Mark Adler here on examples/zran.c in the zlib distribution.




回答2:


You are looking for dictzip.py, part of the serpento package. However, you have to compress the files with dictzip, which is a random seekable backward compatible variant of the gzip compression.




回答3:


The indexed_gzip program might be what you wanted. It also uses zran.c under the hood.




回答4:


If you just want to access the file from a random point can't you just do:

from random import randint

with open(filename) as f:
    f.seek(0, 2)
    size = f.tell()
    f.seek(randint(0, size), 2)


来源:https://stackoverflow.com/questions/22950030/how-to-obtain-random-access-of-a-gzip-compressed-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!