UTF-16 Character Encoding of java

后端未结

关注

 5  1168

借酒劲吻你 2020-12-31 06:26

I was trying to understand character encoding in Java. Characters in Java are being stored in 16 bits using UTF-16 encoding. So while i am converting a string containing 6 c

5条回答

没有蜡笔的小新 (楼主)

2020-12-31 07:02

In the UTF-16 version, you get 14 bytes because of a marker inserted to distinguish between Big Endian (default) and Little Endian. If you specify UTF-16LE you will get 12 bytes (little-endian, no byte-order marker added).

See http://www.unicode.org/faq/utf_bom.html#gen7

EDIT - Use this program to look into the actual bytes generated by different encodings:

public class Test {
    public static void main(String args[]) throws Exception {
        // bytes in the first argument, encoded using second argument
        byte[] bs = args[0].getBytes(args[1]);
        System.err.println(bs.length + " bytes:");

        // print hex values of bytes and (if printable), the char itself
        char[] hex = "0123456789ABCDEF".toCharArray();
        for (int i=0; i>4] + "" + hex[b&0xf] 
                + ( ! Character.isISOControl((char)b) ? ""+(char)b : ".")
                + ( (i%4 == 3) ? "\n" : " "));
        }
        System.err.println();   
    }
}

For example, when running under UTF-8 (under other JVM default encodings, the characters for FE and FF would show up different), the output is:

$ javac Test.java  && java -cp . Test hello UTF-16
12 bytes:
FEþ FFÿ 00. 68h
00. 65e 00. 6Cl
00. 6Cl 00. 6Fo

And

$ javac Test.java  && java -cp . Test hello UTF-16LE
10 bytes:
68h 00. 65e 00.
6Cl 00. 6Cl 00.
6Fo 00.

And

$ javac Test.java  && java -cp . Test hello UTF-16BE
10 bytes:
00. 68h 00. 65e
00. 6Cl 00. 6Cl
00. 6Fo

0 讨论(0)

查看其它5个回答