Zlib deflated input is larger than original input string of chars?

问题

I'm a bit confused by zlib compressing an input of a string of type char. Below I have the output from the code as posted and what I noticed was that the input string was shorter in bytes compared to the output.

The uncompressed size was 8 bytes and the compressed is 12? Am I not seeing this correctly instead?

Here's the code.

#include <stdio.h>
#include <string.h>
#include <assert.h>
#include <iostream>
#include "zlib.h"

void print( char *array, int length)
{
    for(int index = 0; index < length; index++)
        std::cout<<array[index];

    std::cout<<std::endl;
}
void clear( char *array, int length)
{
    for(int index = 0; index < length; index++)
        array[index] = 0;
}
int main()
{
    const int length = 30;
    char a[length] = "HHHHHHH";
    char b[length] = "";
    char c[length] = "";

    print( a, length);

    std::cout<<std::endl;
    uLong ucompSize = strlen(a)+1; // "string" + NULL delimiter.
    std::cout<<"ucompSize: "<<ucompSize<<std::endl;
    uLong compSize = compressBound(ucompSize);
    std::cout<<"compSize: "<<compSize<<std::endl;
    std::cout<<std::endl;
    // Deflate
    compress((Bytef *)b, &compSize, (Bytef *)a, ucompSize);
    std::cout<<"ucompSize: "<<ucompSize<<std::endl;
    std::cout<<"compSize: "<<compSize<<std::endl;
    print( b, length);
    std::cout<<std::endl;
    // Inflate
    uncompress((Bytef *)c, &ucompSize, (Bytef *)b, compSize);
    std::cout<<"ucompSize: "<<ucompSize<<std::endl;
    std::cout<<"compSize: "<<compSize<<std::endl;
    print( c, length);

    return 0;
}

And here's the output.

HHHHHHH

ucompSize: 8
compSize: 21

ucompSize: 8
compSize: 12
x��     ��

ucompSize: 8
compSize: 12
HHHHHHH

Process returned 0 (0x0)   execution time : 0.013 s
Press ENTER to continue.

回答1:

If you want to avoid that, you can use the compressBound() function to check if the size would actually be larger than your current data is:

ZEXTERN uLong ZEXPORT compressBound OF((uLong sourceLen));
compressBound() returns an upper bound on the compressed size after compress() or compress2() on sourceLen bytes. It would be used before a compress() or compress2() call to allocate the destination buffer.

回答2:

At least six of those bytes would be the two magic bytes (header) at the front of the compressed stream, identifying it as a zlib-compressed file, and four bytes for a checksum. Not counting the overhead of the format would leave, at most, six bytes of compressed data, which is smaller than your input stream.

Refer to §2.2 of the RFC for more detail about the file format. You could use tools like xxd or hexdump to investigate the hexadecimal byte groups to confirm what parts of the output stream are overhead and what is compressed data.

回答3:

The compress() function uses the zlib format, which puts a two-byte header and four-byte trailer around the raw compressed data. Even if the raw compressed data is smaller than the original string, you will get six more bytes from the wrapper. For an empty string, no bytes at all, the raw compressed data is two bytes. So the minimum size of a zlib stream is eight bytes. Eight repeated input bytes can result in raw compressed data as short as four bytes, so the minimum zlib-wrapped result is ten bytes.

In general you need much larger inputs for lossless compression to be effective.

来源：https://stackoverflow.com/questions/43985477/zlib-deflated-input-is-larger-than-original-input-string-of-chars

标签

c++

compression

zlib