UTF-16 Character Encoding of java

后端未结

关注

 5  1148

I was trying to understand character encoding in Java. Characters in Java are being stored in 16 bits using UTF-16 encoding. So while i am converting a string containing 6 c

相关标签:

5条回答

没有蜡笔的小新

2020-12-31 07:02

In the UTF-16 version, you get 14 bytes because of a marker inserted to distinguish between Big Endian (default) and Little Endian. If you specify UTF-16LE you will get 12 bytes (little-endian, no byte-order marker added).

See http://www.unicode.org/faq/utf_bom.html#gen7

EDIT - Use this program to look into the actual bytes generated by different encodings:

public class Test {
    public static void main(String args[]) throws Exception {
        // bytes in the first argument, encoded using second argument
        byte[] bs = args[0].getBytes(args[1]);
        System.err.println(bs.length + " bytes:");

        // print hex values of bytes and (if printable), the char itself
        char[] hex = "0123456789ABCDEF".toCharArray();
        for (int i=0; i<bs.length; i++) {
            int b = (bs[i] < 0) ? bs[i] + 256 : bs[i];
            System.err.print(hex[b>>4] + "" + hex[b&0xf] 
                + ( ! Character.isISOControl((char)b) ? ""+(char)b : ".")
                + ( (i%4 == 3) ? "\n" : " "));
        }
        System.err.println();   
    }
}

For example, when running under UTF-8 (under other JVM default encodings, the characters for FE and FF would show up different), the output is:

$ javac Test.java  && java -cp . Test hello UTF-16
12 bytes:
FEþ FFÿ 00. 68h
00. 65e 00. 6Cl
00. 6Cl 00. 6Fo

And

$ javac Test.java  && java -cp . Test hello UTF-16LE
10 bytes:
68h 00. 65e 00.
6Cl 00. 6Cl 00.
6Fo 00.

And

$ javac Test.java  && java -cp . Test hello UTF-16BE
10 bytes:
00. 68h 00. 65e
00. 6Cl 00. 6Cl
00. 6Fo

0 讨论(0)

清酒与你

2020-12-31 07:07

I think this will help: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

And this will help as well: "UTF-16 (16-bit Unicode Transformation Format) is a character encoding [...] The encoding is a variable-length encoding as code points are encoded with one or two 16-bit code units." (from Wikipedia)

0 讨论(0)
发布评论:

提交评论
- 加载中...
后悔当初

2020-12-31 07:11

For UTF-16 encoding use str.getBytes("UTF-16");

but it gives 14 length for byte[] please refer [link] http://rosettacode.org/wiki/String_length for more details.

0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2020-12-31 07:17

As per the String.getBytes() method's documentation, the string is encoded into a sequence of bytes using the platform's default charset.

I assume, your platform default charset will be ISO-8859-1 (or a similar one-byte-per-char-charset). These charsets will encode one character into one byte.

If you want to specify the encoding, use the method String.getBytes(Charset) or String.getBytes(String).

About the 16-bit storing: This is how Java internally stores characters, so also strings. It is based on the original Unicode specification.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野性不改

2020-12-31 07:17
String.getBytes() uses default platform encoding. Try this
```
byte bt[] = str.getBytes("UTF-16");
```
0 讨论(0)
发布评论:

提交评论
- 加载中...