UTF-8 Encoding ; Only some Japanese characters are not getting converted

孤街浪徒 提交于 2019-11-27 06:09:58

问题


I am getting the parameter value as parameter from the Jersey Web Service, which is in Japaneses characters.

Here, 'japaneseString' is the web service parameter containing the characters in japanese language.

   String name = new String(japaneseString.getBytes(), "UTF-8");

However, I am able to convert a few sting literals successfully, while some of them are creating problems.

The following were successfully converted:

 1) アップル
 2) 赤
 3) 世丕且且世两上与丑万丣丕且丗丕
 4) 世世丗丈

While these din't:

 1) ひほわれよう
 2) 存在する

When I further investigated, i found that these 2 strings are getting converted in to some JUNK characters.

 1) Input: ひほわれよう        Output : �?��?��?れよ�?�
 2) Input: 存在する            Output: 存在�?�る

Any idea why some of the japanese characters are not converted properly?

Thanks.


回答1:


Try with JVM parameter file.encoding to set with value UTF-8 in startup of Tomcat(JVM). E.x.: -Dfile.encoding=UTF-8




回答2:


You are mixing concepts here.

A String is just a sequence of characters (chars); a String in itself has no encoding at all. For what it's worth, replace characters in the above with carrier pigeons. Same thing. A carrier pigeon has no encoding. Neither does a char. (1)

What you are doing here:

new String(x.getBytes(), "UTF-8")

is a "poor man's encoding/decoding process". You will probably have noticed that there are two versions of .getBytes(): one where you pass a charset as an argument and the other where you don't.

If you don't, and that is what happens here, it means you will get the result of the encoding process using your default character set; and then you try and re-decode this byte sequence using UTF-8.

Don't do that. Just take in the string as it comes. If, however, you have trouble reading the original byte stream into a string, it means you use a Reader with the wrong charset. Fix that part.

For more information, read this link.

(1) the fact that, in fact, a char is a UTF-16 code unit is irrelevant to this discussion




回答3:


I concur with @fge.

Clarification

In java String/char/Reader/Writer handle (Unicode) text, and can combine all scripts in the world.

And byte[]/InputStream/OutputStream are binary data, which need an indication of some encoding to be converted to String.

In your case japaneseStingr should already be a correct String, or be substituted by the original byte[].

Traps in Java

Encoding often is an optional parameter, which then defaults to the platform encoding. You fell in that trap too:

String s = "...";
byte[] b = s.getBytes(); // Platform encoding, non-portable.
byte[] b = s.getBytes("UTF-8"); // Explicit
byte[] b = s.getBytes(StandardCharsets.UTF_8); // Explicit,
                         //  better (for UTF-8, ISO-8859-1)

In general avoid the overloaded methods without encoding parameter, as they are for current-computer only data: non-portable. For completeness: classes FileReader/FileWriter should be avoided as they even provide no encoding parameters.

Error

japaneseString is already wrong. So you have to read that right. It could have been read erroneouslyas Windows-1252 (Windows Latin-1) and suffered when recoding to UTF-8. Evidently only some cases get messed up.

Maybe you had:

String japanesString = new String(bytes);

instead of:

String japanesString = new String(bytes, StandardCharsets.UTF_8);

At the end:

String name = japaneseString;

Show the code for reading japaneseString for further help.



来源:https://stackoverflow.com/questions/24009119/utf-8-encoding-only-some-japanese-characters-are-not-getting-converted

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!