Java Strings Character Encoding - For French - Dutch Locales

后端 未结 3 2048
执念已碎
执念已碎 2021-01-13 16:26

I have the following piece of code

public static void main(String[] args) throws UnsupportedEncodingException {
        System.out.println(Charset.defaultCha         


        
3条回答
  •  刺人心
    刺人心 (楼主)
    2021-01-13 17:18

    When you call upon String getBytes method it:

    Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

    So whenever you do:

    accentedE.getBytes()
    

    it takes the contents of accentedE String as bytes encoded in the default OS code page, in your case cp-1252.

    This line:

    new String(accentedE.getBytes(), Charset.forName("UTF-8"))
    

    takes the accentedE bytes (encoded in cp1252) and tries to decode them in UTF-8, hence the error. The same situation from the other side for:

    new String(accentedE.getBytes("utf-8"))
    

    The getBytes method takes the accentedE bytes encoded in cp-1252, reencodes them in UTF-8 but then the String constructor encodes them with the default OS codepage which is cp-1252.

    Constructs a new String by decoding the specified array of bytes using the platform's default charset. The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array.

    I strongly recommend reading this excellent article:

    The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

    UPDATE:

    In short, every character is stored as a number. In order to know which character is which number the OS uses the codepages. Consider the following snippet:

    String accentedE = "é";
    
    System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[0]));
    System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[1]));
    System.out.println(String.format("%02X ", accentedE.getBytes("windows-1252")[0]));
    

    which outputs:

    C3 
    A9 
    E9
    

    That is because small accented e in UTF-8 is stored as two bytes of value C3A9, while in cp-1252 is stored as a single byte of value E9. For detailed explanation read the linked article.

提交回复
热议问题