Java- How to verify if Thai characters are encoded correctly from UTF-8 to TIS620

问题

Get input string in UTF-8, I applied TIS620 encoding and created new string from it now how to retain the bytes? since UTF-8 represents Thai char in 3 bytes where as TIS620 in 1 byte. I've requirement where the backend system stores characters in string as 1 byte only so default UTF-8 breaks it.

How to convert String character encoding from UTF-8 to TIS620?
How to retain the byte size while passing it to backend system?
If the string is reassigned to new String , Does character encoding is retained or it again gets converted to UTF-16 (Java default)?
Is it possible in Java? Any lib/utility which can be integrated?

I've tried below code and can check that post TIS620 the byte count matches the character count i.e.1 byte/char. But if encodedString gets new String assignment will it loose TIS620 format?

(Convert String with encoding UTF-8 to TIS620 (Thai encoding) in Java.What are the ways to do it and it there any data loss?)

public String encode() {
        try {
String input = " "ใบใบใบใบ"";
            byte [] encodedBytes= input.getBytes("TIS620");
            String encodedString = new String(encodedBytes,"TIS620");
            }catch (UnsupportedEncodingException e){
            //Encoding failed           
        }
    }

Expected result is, if I convert 5 Thai character from UTF-8 format to TIS620 the byte count should be converted and retained from 15 (UTF-8) to 5 (TIS620)?

回答1:

Java's text datatypes (String, char and Character)—same goes for .NET, JavaScript, VB4/5/6/A/Script, …) always use the UTF-16 character encoding of the Unicode character set.

Many interfaces, bindings, drivers, data adaptors, and what not, understand that the text datatype is UTF-16 and which character encoding the target needs and so does a conversion itself. As long as you are using Java datatypes, if you have text encoding as UTF-8 or TIS620, you would typically use a byte array.

That it for straightforward text as text.

Now, if you had an array of arbitrary bytes and you want to write it into a text context, you could use Base64. Such a function takes a byte array and returns a String (UTF-16 encoded, of course). But since the characters used are supported by every character set, there would be no loss of data to convert the data to using whichever character encoding is needed.

People do like dealing with text datatypes so the above scheme is great. But for some reason, instead of Base64, some people use what I call Base256. They have an array of bytes (very often created from encoding text with a character encoding) and they apply an encoding function to convert the bytes to text, choosing to encode by decoding with a character encoding. You need to identify if that's what you are dealing with and if so, which character encoding was co-opted as a Base256 encoding. (Often the character encoding used for this is ISO 8859-1.)

回答2:

A String in Java is always encoded in UTF-16, no matter how it was constructed. Or put differently: as soon as you have a String object, you should not care about which encoding it has. The encoding only comes back into the picture once you want to go back towards a byte[] (or OutputStream or the like).

This is correct and almost certainly exactly what you want to do. You should not try to work around that fact.

If you need to write the string to disk or send it to some other system in some specific encoding then you can get that encoded data from the String by using getBytes() as you did in your sample code.

In other words:

A String object in Java can not "have TIS620" encoding. A byte[] can contain TIS620 encoded data and you create that from a String using .getBytes("TIS620").
If you pass the encoded byte[] to the other system, it will have the correct byte size, simply because it was created with the correct encoding.
String always uses UTF-16. Creating a String with the content "ใบใบใบใบ" from UTF-8 data and from TIS620 data will produce exactly identical String objects, there's no way to know what encoding was used to create them.
InputStreamReader, OutputStreamWriter and comparable classes can also be passed an encoding to decode/encode with that encoding respectively. Other than that, no special handling is required.

来源：https://stackoverflow.com/questions/58181867/java-how-to-verify-if-thai-characters-are-encoded-correctly-from-utf-8-to-tis62

标签

java

encoding

utf-8

character-encoding

utf-16