What character encoding is this?

╄→гoц情女王★ 提交于 2021-02-10 13:24:53

问题


I'm interfacing with an Oracle DB, which has some messed up encoding (ASCII7 according to the db properties, but actually encodes Korean characters).

When I get some of the Korean strings from the resultSet, and look at the bytes, it turns out that they correspond exactly to this file (I found by googling some of the byte sequences): http://211.115.85.9/files/raw3.txt

Kinda spooky, as it seems to be the ONLY thing on the internet that has anything about this particular encoding...

The file, when viewed with EditPlus3, shows me 3 columns.

The first column is an alphabetical listing of Korean characters. The second is the strange encoding I'm finding from looking at the Java strings passed from the Oracle DB. The third one is UTF8.

I'm trying to figure out what the middle column is encoded in. Can anyone point me in the right direction?

(I really don't want to have to actually read from this file every time I need to call a DB...)


回答1:


It is EUC-KR (or a similar) encoded data, interpreted as another 1-byte encoding (ISO-8859-1 or similar) and encoded using UTF-8.

In other words: it's ill-encoded data, but might be salvagable:

byte[] bytes = new byte[] { (byte) 0xc2, (byte) 0xb0, (byte) 0xc2, (byte) 0xa1 };
String str = new String(bytes, "UTF-8");
bytes = str.getBytes("ISO-8859-1");
str = new String(bytes, "EUC-KR");
System.out.println(str);

This prints 가 on my system.

I've found this PDF file which explains the problem (and how it happend) in more detail.




回答2:


It is UTF-8 encoding:

가 c2b0c2a1 eab080
각 c2b0c2a2 eab081
간 c2b0c2a3 eab084
갇 c2b0c2a4 eab087
...

I don't know the meaning of the middle column, but the third column is a hex-representation of the Hangul in the first row.

Watch the file with a hex editor, this may help.

Good luck! :)




回答3:


I wrote a little script and decoded the middle column of the first two lines brute force.

The following four results are Hangul but I'm not sure, if they make sense:

utf_16_be => 슰슡 슰슢
johab => 춿춰 춿춱
euc_kr => 째징 째짖
cp949 => 째징 째짖

I hope that helped. Have a nice day! :)



来源:https://stackoverflow.com/questions/5854490/what-character-encoding-is-this

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!