What can be the least possible value of data-compression-ratio for any real dataset

问题

I am writing ZLIB like API for an embedded hardware compressor which uses deflate algorithm for compression of given input stream.

Before going further i would like to explain data compression ratio. Data compression ratio is defined as the ratio between the uncompressed size and compressed size.

Compression ratio is usually greater than one. which mean compressed data is usually smaller than uncompressed data, which is whole point to do compression. but this is not always the case. for example using ZLIB library and pseudo-random data generated on some Linux machine give compression ratio of 0.996 roughly. which mean 9960 bytes compressed into 10000 bytes.

I know ZLIB handle this situation by using type 0 block where it simply return original uncompressed data with roughly 5 byte header so it give only 5 byte overhead up to 64KB data-block. This is intelligent solution of this problem but for some reason i can not use this in my API. I must have to provide extra safe space in advance to handle this situation.

Now if i know the least possible known data compression ratio it would be easy for me to calculate the extra space i have to provide. Otherwise to be safe, i have to provide more than needed extra space which can be crucial in embedded system.

While calculating data compression ratio, i am not concerned with header,footer,extremely small dataset and system specific details as i am separately handling that. What i am particularly interested in, is there exist any real dataset with minimum size of 1K and which can provide compression ratio less than 0.99 using deflate algorithm. In that case calculation would be:
Compression ratio = uncompressed size/(compressed size using deflate excluding header,footer and system specific overhead)

Please provide feedback. Any help would be appreciated. It would be great if reference to such dataset could be provided.

EDIT:
@MSalters comment indicate that hardware compressor is not following deflate specification properly and this can be a bug in microcode.

回答1:

The deflate algorithm has a similar approach as the ZLIB algorithm. It uses a 3 bit header, and the lower two bits are 00 when the following block is stored length-prefixed but otherwise uncompressed.

This means the worst case is an one byte input that blows up to 6 bytes (3 bits header, 32 bits length, 8 bits data, 5 bits padding), so the worst ratio is 1/6 = 0.16.

This is of course assuming an optimal encoder. A suboptimal encoder would transmit an Huffman table for that one byte.

回答2:

because the pigeon priciple

http://en.wikipedia.org/wiki/Pigeonhole_principle

you will always have strings that get compressed and strings that get expanded

http://matt.might.net/articles/why-infinite-or-guaranteed-file-compression-is-impossible/

theoretically you can achieve best compression with 0 entropy data (infinite compression ratio) and worst compression with infinite entropy data (AWGN noise, so you have 0 compression ratio).

回答3:

I can't tell from your question whether you're using zlib or not. If you're using zlib, it provides a function, deflateBound(), which does exactly what you're asking for, taking an uncompressed size and returning the maximum compressed size. It takes into account how the deflate stream was initialized with deflateInit() or deflateInit2() in computing the proper header and trailer sizes.

If you're writing your own deflate, then you will already know what the maximum compressed size is based on how often you allow it to use stored blocks.

Update: The only way to know for sure the maximum data expansion of a hardware deflator is to obtain the algorithm used. Then through inspection you can determine how often it will emit stored blocks for random data.

The only alternative is empirical and unreliable. You can feed the hardware compressor random data, and examine the results. You can use infgen to disassemble the deflate output and see the stored blocks and their sizes. Then you can write a linear bounding formula for the expansion. Then add some margin to the additive and multiplicative terms to cover for situations that you did not observe in your tests.

This will only work if the hardware deflate algorithm is well behaved, which means that it will not write a fixed or dynamic deflate block if a stored block would be smaller. If it is not well behaved, then all bets are off.

来源：https://stackoverflow.com/questions/18978731/what-can-be-the-least-possible-value-of-data-compression-ratio-for-any-real-data

标签

Linux

embedded

compression

zlib