ISO-8859-1 encoding and binary data preservation

前端 未结 2 1227
不知归路
不知归路 2020-11-29 10:35

I read in a comment to an answer by @Esailija to a question of mine that

ISO-8859-1 is the only encoding to fully retain the original binary data, wi

2条回答
  •  借酒劲吻你
    2020-11-29 11:15

    "\u00F6" is not a byte array. It's a string containing a single char. Execute the following test instead:

    public static void main(String[] args) throws Exception {
        byte[] b = new byte[] {(byte) 0x00, (byte) 0xf6};
        String s = new String(b, "ISO-8859-1"); // decoding
        byte[] b2 = s.getBytes("ISO-8859-1"); // encoding
        System.out.println("Are the bytes equal : " + Arrays.equals(b, b2)); // true
    }
    

    To check that this is true for any byte, just improve the code an loop through all the bytes:

    public static void main(String[] args) throws Exception {
        byte[] b = new byte[256];
        for (int i = 0; i < b.length; i++) {
            b[i] = (byte) i;
        }
        String s = new String(b, "ISO-8859-1");
        byte[] b2 = s.getBytes("ISO-8859-1");
        System.out.println("Are the bytes equal : " + Arrays.equals(b, b2));
    }
    

    ISO-8859-1 is a standard encoding. So the language used (Java, C# or whatever) doesn't matter.

    Here's a Wikipedia reference that claims that every byte is covered:

    In 1992, the IANA registered the character map ISO_8859-1:1987, more commonly known by its preferred MIME name of ISO-8859-1 (note the extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the Internet. This map assigns the C0 and C1 control characters to the unassigned code values thus provides for 256 characters via every possible 8-bit value.

    (emphasis mine)

提交回复
热议问题