This is similar to a previous question, but the answers there don\'t satisfy my needs and my question is slightly different:
I currently use gzip compression for som
I am the author of an open-source tool for compressing a particular type of biological data. This tool, called starch
, splits the data by chromosome and uses those divisions as indices for fast access to compressed data units within the larger archive.
Per-chromosome data are transformed to remove redundancy in genomic coordinates, and the transformed data are compressed with either bzip2
or gzip
algorithms. The offsets, metadata and compressed genomic data are concatenated into one file.
Source code is available from our GitHub site. We have compiled it under Linux and Mac OS X.
For your case, you could store (10 MB, or whatever) offsets in a header to a custom archive format. You parse the header, retrieve the offsets, and incrementally fseek
through the file by current_offset_sum
+ header_size
.