String, byte[] and compression

问题

We can disassemble String to and from byte[] easily

        String s = "my string";
        byte[] b = s.getBytes();
        System.out.println(new String(b)); // my string

When compression is involved however there seem to be some issues. Suppose you have 2 methods, compress and uncompress (code below works fine)

public static byte[] compress(String data) 
             throws UnsupportedEncodingException, IOException {
    byte[] input = data.getBytes("UTF-8");
    Deflater df = new Deflater();
    df.setLevel(Deflater.BEST_COMPRESSION);
    df.setInput(input);

    ByteArrayOutputStream baos = new ByteArrayOutputStream(input.length);
    df.finish();
    byte[] buff = new byte[1024];
    while (!df.finished()) {
        int count = df.deflate(buff);
        baos.write(buff, 0, count);
    }
    baos.close();
    byte[] output = baos.toByteArray();

    return output;
}

public static String uncompress(byte[] input) 
            throws UnsupportedEncodingException, IOException,
        DataFormatException {
    Inflater ifl = new Inflater();
    ifl.setInput(input);

    ByteArrayOutputStream baos = new ByteArrayOutputStream(input.length);
    byte[] buff = new byte[1024];
    while (!ifl.finished()) {
        int count = ifl.inflate(buff);
        baos.write(buff, 0, count);
    }
    baos.close();
    byte[] output = baos.toByteArray();

    return new String(output);
}

My Tests work as follows (works fine)

String text = "some text";
byte[] bytes = Compressor.compress(text);
assertEquals(Compressor.uncompress(bytes), text); // works

For no reason other then, why not, i'd like to modify the first method to return a String instead of the byte[].

So i return new String(output) from the compress method and modify my tests to:

String text = "some text";
String compressedText = Compressor.compress(text);
assertEquals(Compressor.uncompress(compressedText.getBytes), text); //fails

This test fails with java.util.zip.DataFormatException: incorrect header check

Why is that? What needs to be done to make it work?

回答1:

The String(byte[]) constructor is the problem. You cannot simply take arbitrary bytes, convert them to a string and then back to byte array. String class performs sophisticated encoding on this byte based on desired charset. If given byte sequence can't be represented e.g. in Unicode it will be discarded or converted to something else. The conversion from bytes to String and back to bytes is lossless only if these bytes really represented some String (in some encoding).

Here is a simplest example:

new String(new byte[]{-128}, "UTF-8").getBytes("UTF-8")

The above returns -17, -65, -67 while 127 input returns the exact same output.

回答2:

It fails, because you just convert from bytes to string using the current encoding of your platform. So most bytes will be converted to their equivalent character codes but some might be replaced by other codes, depending on the current encoding. To see what happens to your bytes, just run:

byte[] b = new byte[256];
for(int i = 0; i < b.length; ++i) {
    b[i] = (byte)i;
}
String s = new String(b);

for(int i = 0; i< s.length(); ++i) {
    System.out.println(i + ": " + s.substring(i, i+1) + " " + (int)s.charAt(i));
}

As you can see, if you convert that back to bytes some codes fall all to the same value. And this sample does not handle encodings where a character is encoded with more than one code as in UTF-8.

In general one should avoid calling String.getBytes() and new String(byte[]) without supplying an appropriate encoding parameter. And there is no one-to-one encoding where each byte becomes the corresponding character code unless you code your own.

If you really want to handle your compressed data as String, then use a base64 representation or a hex dump. But beware, the string representation needs twice as much memory, base64 adds a factor of 4/3, hex even a factor of 2. This might eat up the benefit of compression.

来源：https://stackoverflow.com/questions/11762975/string-byte-and-compression

标签

java

compression