parsing chinese characters in java showing weird behaviour

╄→尐↘猪︶ㄣ 提交于 2020-01-05 08:27:23

问题


I am having a csv file which has some fields having chinese character strings. Unfortunately i dont know what is encoding of this input csv file. I am trying to read this input csv and using selective fields from it, i am making a html and another csv file as output.

While reading csv input, i tried all encoding from list http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html which have Chinese mentioned in their description. And found if I use

InputStreamReader read = new InputStreamReader(filepath,"GB18030");

for reading csv and

OutputStreamWriter osW=new OutputStreamWriter(objBufferedOutputStream,"UTF-16");

For writing html and csv, my output doesnt show weird characters.

But, there are 2 problems:

  1. The output is showing strings which are altogether different from input ! I mean, even when im not doing any processing on any string from my code, the output is not found in any field of input csv.

For example, my input has a chinese char string: 陈真珍 on field number 8. but my output html has something like: 闄堢湡鐝� which corresponds to input field number 8.

  1. as u can see, there is a questionmark, i.e. replacement char from unicode in output 闄堢湡鐝�

I request you to kindly help me trace where can be a mistake here...

PS: Aiso, I checked Google translation and found,input string 陈真珍 means some Chen Zhen Zhen

and its corresponding output string 闄堢湡鐝� means something called as Yaobaoyujue So there is difference in meaning as well as representation of characters also.


回答1:


That output means that your input is NOT in GB18030 encoding.

Also: please check and double-check how you view your files: what encoding does the program use that opens the files, specifically the input file. Usually text files (and CSV files) don't come with metadata attached to them that shows their encoding, so the editors have to guess and that guess can easily be wrong.




回答2:


Please keep the enconding be consistent when reading / writing Chinese character. Since some Chinese character may not be represented by the all the encodings, such as GBK, GB18030 etc.

You can have a try to use UTF-8 enconding to handle Chinese character.



来源:https://stackoverflow.com/questions/19654146/parsing-chinese-characters-in-java-showing-weird-behaviour

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!