Thai script seems to lose UTF-8 encoding in java for-each loop

被刻印的时光 ゝ 提交于 2021-01-29 08:49:25

问题


I'm trying to develop an application within Android Studio on Windows 10.

PROBLEM: The following string array of Thai words:

String[] myTHarr = {"มาก","เชี่ยว","แน่","ม่อน","บ้าน","พูด","เลื่อย","เมื่อ","ช่ำ","แร่"};

...when processed by the following for-each loop:

for (String s:myTHarr){
  //s = มา� before executing any of the below code:
  byte[] utf8EncodedThaiArr = s.getBytes("UTF-8"); 
  String utf8EncodedThai = new String(utf8EncodedThaiArr); //setting breakpoint here
  // s is still มาà¸�     (I want it to be มาก)
  //do stuff
}

results in s = มา� when attempting to process the first word (none of the other words work either, but that's expected given the first fails).

The Thai script appears in the string array correctly (the declaration was copied straight from Android Studio), the file encoding is set to UTF-8 for the java file (per here), and the File Encoding Settings look like this (per here):


回答1:


According to the documentation, String(byte[]) constructor "Constructs a new String by decoding the specified array of bytes using the platform's default charset."

I'm guessing that the default character set is not UTF-8. So the solution is to specify the encoding for the array of bytes.

String utf8EncodedThai = new String(utf8EncodedThaiArr, "UTF-8"); //setting breakpoint here



回答2:


As several in the comments pointed out the problem had to be within my environment. After a bit more searching I found I should have rebuilt the project after changing the encodings (so merely switching to UTF8 and clicking 'Apply'/'OK' wasn't enough). I should note here that my File Encoding settings look like this, for reference:

Once I rebuilt, I started getting the compiler error "unmappable character for encoding cp1252" on the String array containing the Thai (side note: Some of the Thai characters were fine, others rendered as � and friends. I would have thought either all of the Thai would work or none of it, but was surprised to see even common Thai letters such as ก cause the compiler to choke).

That error led to this post in which I tried a few things to set the compiler options to UTF8. Since my application happens to be a sort of 'pre-process' for an android app, and is therefore separate from the app itself (if that makes any sense), I didn't have the luxury of using the compilerOptions attribute as the answers in the aforementioned SO post recommended (though I have since added it to the gradle on the android app side). This led me to setting the environment variable JAVA_TOOLS_OPTIONS via powershell:

setx JAVA_TOOLS_OPTIONS "-Dfile.encding=UTF8"

Which fixed the issue!




回答3:


I tried your code with the attached settings, and the code worked fine.



来源:https://stackoverflow.com/questions/63580725/thai-script-seems-to-lose-utf-8-encoding-in-java-for-each-loop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!