问题
I'm building a index which is just several sets of ordered 32 bit integers stored continuously in a binary file. The problem is that this file grows pretty large. I've been thinking of adding some compressions scheme but that's a bit out of my expertise. So I'm wondering, what compression algorithm would work best in this case? Also, decompression has to be fast since this index will be used to make make look ups.
回答1:
If you are storing integers which are close together (eg: 1, 3 ,4, 5, 9, 10 etc... ) rather than some random 32 bit integers (982346..., 3487623412.., etc) you can do one thing:
Find the differences between the adjacent numbers which would be like 2,1,1,4,1... etc.(in our example) and then Huffman encode this numbers.
I don't think Huffman encoding will work if you directly apply them to the original list of numbers you have.
But if you have a sorted list of near-by numbers, the odds are good that you will get a very good compression ratio by doing Huffman encoding of the number differences, may be better ratio than using the LZW algorithm used in the Zip libraries.
Anyway thanks for posting this interesting question.
回答2:
Are the integers grouped in a dense way or a sparse way?
By dense I'm referring to:
[1, 2, 3, 4, 42, 43, 78, 79, 80, 81]
By sparse I'm referring to:
[1, 4, 7, 9, 19, 42, 53, 55, 78, 80]
If the integers are grouped in a dense way you could compress the first vector to hold three ranges:
[(1, 4), (42, 43), (78, 81)]
Which is a 40% compression. Of course this algorithm does not work well on sparse data as the compressed data would take up 100% more space than the original data.
回答3:
As you've discovered, a sorted sequence of N 32 bits integers doesn't have 32*N bits of data. This is no surprise. Assuming no duplicates, for every sorted sequence there are N! unsorted seqeuences containing the same integers.
Now, how do you take advantage of the limited information in the sorted sequence? Many compression algorithms base their compression on the use of shorter bitstrings for common input values (Huffman uses only this trick). Several posters have already suggested calculating the differences between numbers, and compressing those differences. They assume it will be a series of small numbers, many of which will be identical. In that case, the difference sequence will be compressed well by most algorithms.
However, take the Fibonacci sequence. That's definitely sorted integers. The difference between F(n) and F(n+1) is F(n-1). Hence, compressing the sequence of differences is equivalent to compressing the sequence itself - it doesn't help at all!
So, what we really need is a statistical model of your input data. Given the sequence N[0]...N[x], what is the probability distribution of N[x+1] ? We know that P(N[x+1] < N[x]) = 0, as the sequence is sorted. The differential/Huffman-based solutions presented work because they assume P(N[x+1] - N[x] = d) is quite high for small positive d and independent from x, so they use can use a few bits for the small differences. If you can give another model, you can optimize for that.
回答4:
If you need fast random-access lookup, then a Huffman-encoding of the differences (as suggested by Niyaz) is only half the story. You will probably also need some sort of paging/indexing scheme so that it is easy to extract the nth number.
If you don't do this, then extracting the nth number is an O(n) operation, as you have to read and Huffman decode half the file before you can find the number you were after. You have to choose the page size carefully to balance the overhead of storing page offsets against the speed of lookup.
回答5:
The conditions on the lists of integers is slightly different, but the question Compression for a unique stream of data suggests several approaches which could help you.
I'd suggest prefiltering the data into a start
and a series of offset
s. If you know that the offsets will reliably small you could even encode them as 1- or 2-byte quantities instead of 4-bytes. If you don't know this, each offset could still be 4 bytes, but since they will be small diffs, you'll get many more repeats than you would storing the original integers.
After prefiltering, run your output through the compression scheme of your choice - something that works on a byte level, like gzip or zlib, would probably do a really nice job.
回答6:
I would imagine Huffman coding would be quite appropiate for this purpose (and relatively quick compared to other algorithms with similar compression ratios).
EDIT: My answer was only a general pointer. Niyaz's suggestion of encoding the differences between consecutive numbers is a good one. (However if the list is not ordered or the spacing of numbers is very irregular, I think it would be no less effective to use plain Huffman encoding. In fact LZW or similar would likely be best in this case, though possibly still not very good.)
回答7:
MSalters' answer is interesting but might distract you if you don't analyze properly. There are only 47 Fibonacci numbers that fit in 32-bits.
But he is spot on on how to properly solve the problem by analyzing the series of increments to find patterns there to compress.
Things that matter: a) Are there repeated values? If so, how often? (if important, make it part of the compression, if not make it an exception.) b) Does it look quasi-random? This also can be good as a suitable average increment can likely be found.
回答8:
I'd use something bog standard off the shelf before investing in your own scheme.
In Java for example you can use GZIPOutputStream to apply gzip compression.
回答9:
Maybe you could store the differences between consecutive 32-bit integers as 16-bit integers.
来源:https://stackoverflow.com/questions/523733/compress-sorted-integers