UTF-16 Character Encoding of java

后端 未结 5 1168
借酒劲吻你
借酒劲吻你 2020-12-31 06:26

I was trying to understand character encoding in Java. Characters in Java are being stored in 16 bits using UTF-16 encoding. So while i am converting a string containing 6 c

5条回答
  •  没有蜡笔的小新
    2020-12-31 07:02

    In the UTF-16 version, you get 14 bytes because of a marker inserted to distinguish between Big Endian (default) and Little Endian. If you specify UTF-16LE you will get 12 bytes (little-endian, no byte-order marker added).

    See http://www.unicode.org/faq/utf_bom.html#gen7


    EDIT - Use this program to look into the actual bytes generated by different encodings:

    public class Test {
        public static void main(String args[]) throws Exception {
            // bytes in the first argument, encoded using second argument
            byte[] bs = args[0].getBytes(args[1]);
            System.err.println(bs.length + " bytes:");
    
            // print hex values of bytes and (if printable), the char itself
            char[] hex = "0123456789ABCDEF".toCharArray();
            for (int i=0; i>4] + "" + hex[b&0xf] 
                    + ( ! Character.isISOControl((char)b) ? ""+(char)b : ".")
                    + ( (i%4 == 3) ? "\n" : " "));
            }
            System.err.println();   
        }
    }
    

    For example, when running under UTF-8 (under other JVM default encodings, the characters for FE and FF would show up different), the output is:

    $ javac Test.java  && java -cp . Test hello UTF-16
    12 bytes:
    FEþ FFÿ 00. 68h
    00. 65e 00. 6Cl
    00. 6Cl 00. 6Fo
    

    And

    $ javac Test.java  && java -cp . Test hello UTF-16LE
    10 bytes:
    68h 00. 65e 00.
    6Cl 00. 6Cl 00.
    6Fo 00. 
    

    And

    $ javac Test.java  && java -cp . Test hello UTF-16BE
    10 bytes:
    00. 68h 00. 65e
    00. 6Cl 00. 6Cl
    00. 6Fo
    

提交回复
热议问题