Compression formats with good support for random access within archives?

前端 未结 13 2361
滥情空心
滥情空心 2020-11-27 11:45

This is similar to a previous question, but the answers there don\'t satisfy my needs and my question is slightly different:

I currently use gzip compression for som

13条回答
  •  时光取名叫无心
    2020-11-27 12:08

    I am the author of an open-source tool for compressing a particular type of biological data. This tool, called starch, splits the data by chromosome and uses those divisions as indices for fast access to compressed data units within the larger archive.

    Per-chromosome data are transformed to remove redundancy in genomic coordinates, and the transformed data are compressed with either bzip2 or gzip algorithms. The offsets, metadata and compressed genomic data are concatenated into one file.

    Source code is available from our GitHub site. We have compiled it under Linux and Mac OS X.

    For your case, you could store (10 MB, or whatever) offsets in a header to a custom archive format. You parse the header, retrieve the offsets, and incrementally fseek through the file by current_offset_sum + header_size.

提交回复
热议问题