Compression of sequence of integers providing random access

不打扰是莪最后的温柔 提交于 2020-01-02 05:37:07

问题


I have a sequence of n integers in a small range [0,k) and all the integers have the same frequency f (so the size of the sequence is n=f∗k). What I'm trying to do now is to compress this sequence while providing random access (what is the i-th integer). The time to achieve random access doesn't have to be O(1). I'm more interested in achieving high compression at the expense of higher random access times.

I haven't tried with Huffman coding since it assigns codes based on frequencies (and all my frequencies are the same). Perhaps I'm missing some simple encoding for this particular case.

Any help or pointers would be appreciated.

Thanks in advance.

PS: Already asked in cs.stackexchange, but asking here also for better coverage, sorry.


回答1:


If all your integers have the same frequency, then a fair approximation to optimal compression will be ceil(log2(k)) bits per integer. You can access a bit-array of these in constant time.

If k is painfully small (like 3), the above method may waste a fair amount of space. But, you can combine a fixed number of your small integers into a base-k number, which can fit more efficiently into a fixed number of bits (you may also be able to fit the result conveniently into a standard-sized word). In any case, you can also access this coding in constant time.

If your integers don't have the same frequency, optimal compression may yield variable bit rates from different parts of your input, so the simple array access won't work. In that case, good random-access performance would require an index structure: break your compressed data into convenient sized chunks, which can each be decompressed sequentially, but this time is bounded by the chunk size.


If the frequency of each number is exactly the same, you may be able to save some space by taking advantage of this -- but it may not be enough to be worthwhile.

The entropy of n random numbers in range [0,k) is n log2(k), which is log2(k) bits per number; this is the number of bits it takes to encode your numbers without taking advantage of the exact frequency.

The entropy of distinguishable permutations of f copies each of k elements (where n=f*k) is:

log2( n!/(f!)^k ) = log2(n!) - k * log2(f!)

Applying Stirling's approximation (which is good here only if n and f are large), yields:

~ n log2(n) - n log2(e) - k ( f log2(f) - f log2(e) )
= n log2(n) - n log2(e) - n log2(f) + n log2(e)
= n ( log2(n) - log2(f) )
= n log2(n/f)
= n log2(k)

What this means is that, if n is large and k is small, you will not gain a significant amount of space by taking advantage of the exact frequency of your input.

The total error from the Stirling approximation above is O(log2(n) + k log2(f)), which is O(log2(n)/n + log2(f)/f) per number encoded. This does mean that if your k is so large that your f is small (i.e., each distinct number only has a small number of copies), you may be able to save some space with a clever encoding. However, the question specifies that k is, in fact, small.




回答2:


If you work out the number of possible different combinations and take its log base 2 you can find the best possible compression, and I don't think it will be that great in your case. With 16 numbers of frequency 1 the number of possible messages is 16! and Excel tells me log base 2 of 16! is 44.25, whereas storing them as 4-bit codes would only take 64 bits. (where there is more than one of each kind you want http://mathworld.wolfram.com/MultinomialCoefficient.html)

I think you will have a problem mixing random access into this because the only information you have is that there are fixed numbers of each type of element - in the whole sequence. That's not a lot of information for the whole sequences, and it says almost nothing about the first half of the sequence in isolation, because you could well have more of some number in the first half and less in the second half.



来源:https://stackoverflow.com/questions/12517580/compression-of-sequence-of-integers-providing-random-access

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!