Suggestions for compression library to get byte[] as small as possible without considering cpu expense?

问题

Correct me if I'm approaching this wrong, but I have a queue server and a bunch of java workers that I'm running on in a cluster. My queue has work units that are very small but there are many of them. So far my benchmarks and review of the workers has shown that I get about 200mb/second.

So I'm trying to figure out how to get more work units via my bandwidth. Currently my CPU usage is not very high(40-50%) because it can process the data faster than the network can send it. I want to get more work through the queue and am willing to pay for it via expensive compression/decompression(since half of each core is idle right now).

I have tried java LZO and gzip, but was wondering if there was anything better(even if its more cpu expensive)?

Updated: data is a byte[]. Basically the queue only takes it in that format so I am using ByteArrayOutputStream to write two ints and a int[] to to a byte[] format. The values in int[] are all ints between 0 to 100(or 1000 but the vast majority of the numbers are zeros). The lists are quite large anywhere from 1000 to 10,000 items(again, majority zeros..never more than 100 non-zero numbers in the int[])

回答1:

It sounds like using a custom compression mechanism that exploits the structure of the data could be very efficient.

Firstly, using a short[] (16 bit data type) instead of an int[] will halve (!) the amount of data sent, you can do this because the numbers are easily between -2^15 (-32768) and 2^15-1 (32767). This is ridiculously easy to implement.

Secondly, you could use a scheme similar to run-length encoding: a positive number represents that number literally, while a negative number represents that many zeros (after taking absolute values). e.g.

[10, 40, 0, 0, 0, 30, 0, 100, 0, 0, 0, 0] <=> [10, 40, -3, 30, -1, 100, -4]

This is harder to implement that just substituting short for int, but will provide ~80% compression in the very worst case (1000 numbers, 100 non-zero, none of which are consecutive).

I just did some simulations to work out the compression ratios. I tested the method I described above, and the one suggested by Louis Wasserman and sbridges. Both performed very well.

Assuming the length of the array and the number of non-zero numbers are both uniformly between their bounds, both methods save about 5400 ints (or shorts) on average with a compressed size of about 2.5% the original! The run-length encoding method seems to save about 1 additional int (or average compressed size that is 0.03% smaller), i.e. basically no difference, so you should use the one that is easiest to implement. The following are histograms of the compression ratios for 50000 random samples (they are very similar!).

Summary: using shorts instead of ints and one of the compression methods, you will be able to compress the data to about 1% of its original size!

For the simulation, I used the following R script:

SIZE <- 50000

lengths <- sample(1000:10000, SIZE, replace=T)
nonzeros <- sample(1:100, SIZE, replace=T)

f.rle <- function(len, nonzero) {
  indexes <- sort(c(0,sample(1:len, nonzero, F)))
  steps <- diff(indexes)
  sum(steps > 1) + nonzero # one short per run of zeros, and one per zero
}

f.index <- function(len, nonzero) {
  nonzero * 2
}

# using the [value, -1 * number of zeros,...] method
rle.comprs <- mapply(f.rle, lengths, nonzeros)
print(mean(lengths - rle.comprs)) # average number of shorts saved

rle.ratios <- rle.comprs / lengths * 100
print(mean(rle.ratios)) # average compression ratio

# using the [(index, value),...] method
index.comprs <- mapply(f.index, lengths, nonzeros)
print(mean(lengths - index.comprs)) # average number of shorts saved

index.ratios <- index.comprs / lengths * 100
print(mean(index.ratios)) # average compression ratio


par(mfrow=c(2,1))
hist(rle.ratios, breaks=100, freq=F, xlab="Compression ratio (%)", main="Run length encoding")
hist(index.ratios, breaks=100, freq=F, xlab="Compression ratio (%)", main="Store indices")

回答2:

Try encoding your data as two varints, the first varint is the index of the number in the sequence, the second is the number itself. For entries which are 0, write nothing.

回答3:

I wrote an implementation of an RLE algorithm. This operates on a byte array, so could be used as an in-line filter with your existing code. It should safely handle large or negative values should your data change in the future.

It encodes a sequence of zeros as {0}{qty} where {qty} is in the range 1..255. All other bytes are stored as the byte itself. You squish your byte array before sending, and bloat it back to full size when receiving.

public static byte[] squish(byte[] bloated) {
    int size = bloated.length;
    ByteBuffer bb = ByteBuffer.allocate(2 * size);
    bb.putInt(size);
    int zeros = 0;
    for (int i = 0; i < size; i++) {
        if (bloated[i] == 0) {
            if (++zeros == 255) {
                bb.putShort((short) zeros);
                zeros = 0;
            }
        } else {
            if (zeros > 0) {
                bb.putShort((short) zeros);
                zeros = 0;
            }
            bb.put(bloated[i]);
        }
    }
    if (zeros > 0) {
        bb.putShort((short) zeros);
        zeros = 0;
    }
    size = bb.position();
    byte[] buf = new byte[size];
    bb.rewind();
    bb.get(buf, 0, size).array();
    return buf;
}

public static byte[] bloat(byte[] squished) {
    ByteBuffer bb = ByteBuffer.wrap(squished);
    byte[] bloated = new byte[bb.getInt()];
    int pos = 0;
    while (bb.position() < bb.capacity()) {
        byte value = bb.get();
        if (value == 0) {
            bb.position(bb.position() - 1);
            pos += bb.getShort();
        } else {
            bloated[pos++] = value;
        }
    }
    return bloated;
}

回答4:

I've been impressed with BZIP2, compared with 7z and gzip. I haven't personally tried this Java implementation, but it looks like it would be easy to substitute your GZIP call for this one and verify the results.

http://www.kohsuke.org/bzip2

回答5:

You should probably try all the major ones on your data stream and see which works best. You should also consider that some algorithms will take longer to run, adding more latency to the queue. This may or may not be a problem depending on your application.

You can sometimes get better compression if you know something about the data. (dbaupp's answer covers this approach nicely)

This comparison of compression algorithms might be useful. From the article:

来源：https://stackoverflow.com/questions/10369660/suggestions-for-compression-library-to-get-byte-as-small-as-possible-without-c

标签

java

compression