ISO-8859-1 encoding and binary data preservation

前端 未结 2 1232
不知归路
不知归路 2020-11-29 10:35

I read in a comment to an answer by @Esailija to a question of mine that

ISO-8859-1 is the only encoding to fully retain the original binary data, wi

2条回答
  •  醉酒成梦
    2020-11-29 11:14

    For an encoding to retain original binary data, it needs to map every unique byte sequence to an unique character sequence.

    This rules out all multi-byte encodings (UTF-8/16/32, Shift-Jis, Big5 etc) because not every byte sequence is valid in them and thus would decode to some replacement character (usually ? or �). There is no way to tell from the string what caused the replacement character after it has been decoded.

    Another option is to ignore the invalid bytes but this also means that infinite different byte sequences decode to the same string. You could replace invalid bytes with their hex encoding in the string like "0xFF". There is no way to tell if the original bytes legitimately decoded to "0xFF" so that doesn't work either.

    This leaves 8-bit encodings, where every sequence is just a single byte. The single byte is valid if there is a mapping for it. But many 8-bit encodings have holes and don't encode 256 different characters.

    To retain original binary data, you need 8-bit encoding that encodes 256 different characters. ISO-8859-1 is not unique in this. But what it is unique in, is that the decoded code point's value is also the byte's value it was decoded from.

    So you have the decoded string, and encoded bytes, then it is always

    (byte)str.charAt(i) == bytes[i] 
    

    for arbitrary binary data where str is new String(bytes, "ISO-8859-1") and bytes is a byte[].


    It also has nothing to do with Java. I have no idea what his comment means, these are properties of character encodings not programming languages.

提交回复
热议问题