I am trying to solve this interview question.
After given clearly definition of UTF-8 format. ex: 1-byte : 0b0xxxxxxx 2- bytes:.... Asked to wri
Well, I am grateful for the comments and the answer. First of all, I have to agree that this is "another stupid interview question". It is true that in Java String is already encoded, so it will always be compatible with UTF-8. One way to check it is given a string:
public static boolean isUTF8(String s){
try{
byte[]bytes = s.getBytes("UTF-8");
}catch(UnsupportedEncodingException e){
e.printStackTrace();
System.exit(-1);
}
return true;
}
However, since all the printable strings are in the unicode form, so I haven't got a chance to get an error.
Second, if given a byte array, it will always be in the range -2^7(0b10000000) to 2^7(0b1111111), so it will always be in a valid UTF-8 range.
My initial understanding to the question was that given a string, say "0b11111111", check if it is a valid UTF-8, I guess I was wrong.
Moreover, Java does provide constructor to convert byte array to string, and if you are interested in the decode method, check here.
One more thing, the above answer would be correct given another language. The only improvement could be:
In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.
So 4 bytes would be enough.
I am definitely to this, so correct me if I am wrong. Thanks a lot.