UTF-16 Character Encoding of java

后端 未结 5 1148
借酒劲吻你
借酒劲吻你 2020-12-31 06:26

I was trying to understand character encoding in Java. Characters in Java are being stored in 16 bits using UTF-16 encoding. So while i am converting a string containing 6 c

相关标签:
5条回答
  • 2020-12-31 07:02

    In the UTF-16 version, you get 14 bytes because of a marker inserted to distinguish between Big Endian (default) and Little Endian. If you specify UTF-16LE you will get 12 bytes (little-endian, no byte-order marker added).

    See http://www.unicode.org/faq/utf_bom.html#gen7


    EDIT - Use this program to look into the actual bytes generated by different encodings:

    public class Test {
        public static void main(String args[]) throws Exception {
            // bytes in the first argument, encoded using second argument
            byte[] bs = args[0].getBytes(args[1]);
            System.err.println(bs.length + " bytes:");
    
            // print hex values of bytes and (if printable), the char itself
            char[] hex = "0123456789ABCDEF".toCharArray();
            for (int i=0; i<bs.length; i++) {
                int b = (bs[i] < 0) ? bs[i] + 256 : bs[i];
                System.err.print(hex[b>>4] + "" + hex[b&0xf] 
                    + ( ! Character.isISOControl((char)b) ? ""+(char)b : ".")
                    + ( (i%4 == 3) ? "\n" : " "));
            }
            System.err.println();   
        }
    }
    

    For example, when running under UTF-8 (under other JVM default encodings, the characters for FE and FF would show up different), the output is:

    $ javac Test.java  && java -cp . Test hello UTF-16
    12 bytes:
    FEþ FFÿ 00. 68h
    00. 65e 00. 6Cl
    00. 6Cl 00. 6Fo
    

    And

    $ javac Test.java  && java -cp . Test hello UTF-16LE
    10 bytes:
    68h 00. 65e 00.
    6Cl 00. 6Cl 00.
    6Fo 00. 
    

    And

    $ javac Test.java  && java -cp . Test hello UTF-16BE
    10 bytes:
    00. 68h 00. 65e
    00. 6Cl 00. 6Cl
    00. 6Fo
    
    0 讨论(0)
  • 2020-12-31 07:07

    I think this will help: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

    And this will help as well: "UTF-16 (16-bit Unicode Transformation Format) is a character encoding [...] The encoding is a variable-length encoding as code points are encoded with one or two 16-bit code units." (from Wikipedia)

    0 讨论(0)
  • 2020-12-31 07:11

    For UTF-16 encoding use str.getBytes("UTF-16");

    but it gives 14 length for byte[] please refer [link] http://rosettacode.org/wiki/String_length for more details.

    0 讨论(0)
  • 2020-12-31 07:17

    As per the String.getBytes() method's documentation, the string is encoded into a sequence of bytes using the platform's default charset.

    I assume, your platform default charset will be ISO-8859-1 (or a similar one-byte-per-char-charset). These charsets will encode one character into one byte.

    If you want to specify the encoding, use the method String.getBytes(Charset) or String.getBytes(String).

    About the 16-bit storing: This is how Java internally stores characters, so also strings. It is based on the original Unicode specification.

    0 讨论(0)
  • 2020-12-31 07:17

    String.getBytes() uses default platform encoding. Try this

    byte bt[] = str.getBytes("UTF-16");
    
    0 讨论(0)
提交回复
热议问题