random seek in 7z single file archive

问题

Is it possible to do random access (a lot of seeks) to very huge file, compressed by 7zip?

The original file is very huge (999gb xml) and I can't store it in unpacked format (i have no so much free space). So, if 7z format allows accessing to middle block without uncompressing all blocks before selected one, I can built an index of block beginning and corresponding original file offsets.

Header of my 7z archive is

37 7A BC AF 27 1C 00 02 28 99 F1 9D 4A 46 D7 EA  // 7z archive version 2;crc; n.hfr offset
00 00 00 00 44 00 00 00 00 00 00 00 F4 56 CF 92  // n.hdr offset; n.hdr size=44. crc
00 1E 1B 48 A6 5B 0A 5A 5D DF 57 D8 58 1E E1 5F
71 BB C0 2D BD BF 5A 7C A2 B1 C7 AA B8 D0 F5 26
FD 09 33 6C 05 1E DF 71 C6 C5 BD C0 04 3A B6 29

UPDATE: 7z archiver says that this file has a single block of data, compressed with LZMA algorithm. Decompression speed on testing is 600 MB/s (of unpacked data), only one CPU core is used.

回答1:

It's technically possible, but if your question is "does the currently available binary 7zip command line tool allows that', the answer is unfortunately no. The best it allows is to compress independantly each file into the archive, allowing the files to be retrieved directly. But since what you want to compress is a single (huge) file, this trick will not work.

I'm afraid the only way is to chunk your file into small blocks, and to feed them to an LZMA encoder (included in LZMA SDK). Unfortunately that requires some programming skills.

Note : a technically inferior but trivial compression algorithm can be found here. The main program does just what you are looking for : cut the source file into small blocks, and feed them one by one to a compressor (in this case, LZ4). The decoder then does the reverse operation. It can easily skip all the compressed blocks and go straight to the one you want to retrieve. http://code.google.com/p/lz4/source/browse/trunk/lz4demo.c

回答2:

How about this:

Concept: because you are basically reading only one file, index the .7z by block.

read the compressed file block by block, give each block a number and possibly an offset in the large file. scan for target item anchors in the data stream (eg. wikipedia article titles). For each anchor record save the blocknumber where the item began (that was maybe in the block before)

write the index to some kind of O(log n) store. For an access, retrieve the blocknumber and its offset, extract the block and find the item. the cost is bound to extraction of one block (or very few) and the string search in that block.

for this you have to read through the file once, but you can stream it and discard it after processing, so nothing hits the disk.

DARN: you basically postulated this in you question... it seems advantageous to read the question before answering...

回答3:

7z archiver says that this file has a single block of data, compressed with LZMA algorithm.

What was the 7z / xz command to find is it single compressed block or not? Will 7z create multiblock (multistream) archive when used with several threads?

The original file is very huge (999gb xml)

The good news: wikipedia switched to multistream archives for its dumps (at least for enwiki): http://dumps.wikimedia.org/enwiki/

For example, most recent dump, http://dumps.wikimedia.org/enwiki/20140502/ has multistream bzip2 (with separate index "offset:export_article_id:article_name"), and the 7z dump is stored in many sub-GB archives with ~3k (?) articles per archive:

Articles, templates, media/file descriptions, and primary meta-pages, in multiple bz2 streams, 100 pages per stream

enwiki-20140502-pages-articles-multistream.xml.bz2 10.8 GB
enwiki-20140502-pages-articles-multistream-index.txt.bz2 150.3 MB

All pages with complete edit history (.7z)

enwiki-20140502-pages-meta-history1.xml-p000000010p000003263.7z 213.3 MB
enwiki-20140502-pages-meta-history1.xml-p000003264p000005405.7z 194.5 MB
enwiki-20140502-pages-meta-history1.xml-p000005406p000008209.7z 216.1 MB
enwiki-20140502-pages-meta-history1.xml-p000008210p000010000.7z 158.3 MB
enwiki-20140502-pages-meta-history2.xml-p000010001p000012717.7z 211.7 MB
 .....
enwiki-20140502-pages-meta-history27.xml-p041211418p042648840.7z 808.6 MB

I think, we can use bzip2 index to estimate article id even for 7z dumps, and then we just need the 7z archive with the right range (..p first_id p last_id .7z). stub-meta-history.xml may help too.

FAQ for dumps: http://meta.wikimedia.org/wiki/Data_dumps/FAQ

回答4:

Only use:

7z e myfile_xml.7z -so | sed [something]

Example get line 7:

7z e myfile_xml.7z -so | sed -n 7p

来源：https://stackoverflow.com/questions/7882337/random-seek-in-7z-single-file-archive

标签

wikipedia

7zip

compression

random-access