UTF Encoding for Chinese CharactersJava

六月ゝ 毕业季﹏ 提交于 2019-11-29 04:28:41

You have what is known as a double encoding.

You have the three character sequence "你好吗" which you correctly point out is encoded in UTF-8 as E4BDA0 E5A5BD E59097.

But now, start encoding each byte of THAT encoding in UTF-8. Start with E4. What is that codepoint in UTF-8? Try it! It's C3 A4!

You get the idea.... :-)

Here is a Java app which illustrates this:

public class DoubleEncoding {
    public static void main(String[] args) throws Exception {
        byte[] encoding1 = "你好吗".getBytes("UTF-8");
        String string1 = new String(encoding1, "ISO8859-1");
        for (byte b : encoding1) {
            System.out.printf("%2x ", b);
        }
        System.out.println();
        byte[] encoding2 = string1.getBytes("UTF-8");
        for (byte b : encoding2) {
            System.out.printf("%2x ", b);
        }
        System.out.println();
    }

}

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!