How to efficiently predict if data is compressible

落花浮王杯 提交于 2019-11-28 20:26:48

From my experience almost all of the formats that can effectively be compressed are non-binary. So checking if about 70-80% of the characters are within in the [0-127] rage should do the trick.

If you want to to it "properly" (even though I really can't see a reason to do that), you either have to run (parts of) your compression algorithm on the data or calculate the entropy, as tskuzzy already proposed.

I implemented a few methods to test if data is compressible.

Simplified Compression

This basically checks for duplicate byte pairs:

static boolean isCompressible(byte[] data, int len) {
    int result = 0;
    // check in blocks of 256 bytes, 
    // and sum up how compressible each block is
    for (int start = 0; start < len; start += 256) {
        result += matches(data, start, Math.min(start + 255, len));
    }
    // the result is proportional to the number of 
    // bytes that can be saved
    // if we can save many bytes, then it is compressible
    return ((len - result) * 777) < len * 100;
}

static int matches(byte[] data, int i, int end) {
    // bitArray is a bloom filter of seen byte pairs
    // match counts duplicate byte pairs
    // last is the last seen byte
    int bitArray = 0, match = 0, last = 0;
    if (i < 0 || end > data.length) {
        // this check may allow the JVM to avoid
        // array bound checks in the following loop
        throw new ArrayIndexOutOfBoundsException();
    }
    for (; i < end; i++) {
        int x = data[i];
        // the bloom filter bit to set
        int bit = 1 << ((last ^ x) & 31);
        // if it was already set, increment match
        // (without using a branch, as branches are slow)
        match -= (-(bitArray & bit)) >> 31;
        bitArray |= bit;
        last = x;
    }
    return match;
}

On my (limited) set of test data, this algorithm is quite accurate. It about 5 times faster than compressing itself if the data is not compressible. For trivial data (all zeroes), it is about half as fast however.

Partial Entropy

This algorithm estimates the entropy of the high nibbles. I wanted to avoid using too many buckets, because they have to be zeroed out each time (which is slow if the blocks to check are small). 63 - numberOfLeadingZeros is the logarithm (I wanted to avoid using floating point numbers). Depending on the data, it is faster or slower than the algorithm above (not sure why). The result isn't quite as accurate as the algorithm above, possibly because of using only 16 buckets, and only integer arithmetic.

static boolean isCompressible(byte[] data, int len) {
    // the number of bytes with 
    // high nibble 0, 1,.., 15
    int[] sum = new int[16];
    for (int i = 0; i < len; i++) {
        int x = (data[i] & 255) >> 4;
        sum[x]++;
    }
    // see wikipedia to understand this formula :-)
    int r = 0;
    for (int x : sum) {
        long v = ((long) x << 32) / len;
        r += 63 - Long.numberOfLeadingZeros(v + 1);
    }
    return len * r < 438 * len;
}

Calculate the entropy of the data. If it has high entropy (~1.0), it is not likely going to be further compressed. If it has low entropy (~0.0), then that means that there isn't a lot of "information" in it and can be further compressed.

It provides a theoretical measure of how compressed a piece of data can get.

This problem is interesting alone because with for example zlib compressing uncompressible data takes much longer then compressing compressible data. So doing unsuccessful compression is especially expensive (for details see the links). Nice recent work in this area has been done by Harnik et al. from IBM.

Yes, the prefix method and byte order-0 entropy (called entropy in the other posts) are good indicators. Other good ways to guess if a file is compressable or not are (from the paper):

  • Core-set size – The character set that makes up most of the data
  • Symbol-pairs distribution indicator

Here is the FAST paper and the slides.

I expect there's no way to check how compressible something is until you try to compress it. You could check for patterns (more patterns, perhaps more compressible), but then a particular compression algorithmn may not use the patterns you checked for - and may do better than you expect. Another trick may be to take the first 128000 bytes of data, push it through Deflate/Java compression, and see if it's less than the original size. If so - chances are it's worthwhile compressing the entire lot.

Fast compressor such as LZ4 already have built-in checks for data compressibility. They quickly skip the bad segments to concentrate on more interesting ones. To give a proper example, LZ4 on non-compressible data works at almost RAM speed limit (2GB/s on my laptop). So there is little room for a detector to be even faster. You can try it for yourself : http://code.google.com/p/lz4/

It says on your profile that you're the author of the H2 Database Engine, a database written in Java.

If I am guessing correctly, you are looking to engineer this database engine to automatically compress BLOB data, if possible.

But -- (I am guessing) you have realized that not everything will compress, and speed is important -- so you don't want to waste a microsecond more than is necessary when determining if you should compress data...

My question is engineering in nature -- why do all this? Basically, isn't it second-guessing the intent of the database user / application developer -- at the expense of speed?

Wouldn't you think that an application developer (who is writing data to the blob fields in the first place) would be the best person to make the decision if data should be compressed or not, and if so -- to choose the appropriate compression method?

The only possible place I can see automatic database compression possibly adding some value is in text/varchar fields -- and only if they're beyond a certain length -- but even so, that option might be better decided by the application developer... I might even go so far as to allow the application developer a compression plug-in, if so... That way they can make their own decisions for their own data...

If my assumptions about what you are trying to do were wrong -- then I humbly apologize for saying what I said... (It's just one insignificant user's opinion.)

Also -- Why not try lzop? I can personally vouch for the fact that it's faster, much faster (compression and decompression) than bzip, gzip, zip, rar...

http://www.lzop.org

Using it for disk image compression makes the process disk-IO bound. Using any of the other compressors makes the process CPU-bound (i.e., the other compressors use all available CPU, lzop (on a reasonable CPU) can handle data at the same speed a 7200 RPM stock hard drive can dish it out...)

I'll bet if you tested it with the first X bytes of a 'test compression' string, it would be much faster than most other methods...

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!