Implement a function to check if a string/byte array follows utf-8 format

后端 未结 5 1809
遥遥无期
遥遥无期 2020-12-16 00:20

I am trying to solve this interview question.

After given clearly definition of UTF-8 format. ex: 1-byte : 0b0xxxxxxx 2- bytes:.... Asked to wri

5条回答
  •  臣服心动
    2020-12-16 00:57

    Well, I am grateful for the comments and the answer. First of all, I have to agree that this is "another stupid interview question". It is true that in Java String is already encoded, so it will always be compatible with UTF-8. One way to check it is given a string:

    public static boolean isUTF8(String s){
        try{
            byte[]bytes = s.getBytes("UTF-8");
        }catch(UnsupportedEncodingException e){
            e.printStackTrace();
            System.exit(-1);
        }
        return true;
    }
    

    However, since all the printable strings are in the unicode form, so I haven't got a chance to get an error.

    Second, if given a byte array, it will always be in the range -2^7(0b10000000) to 2^7(0b1111111), so it will always be in a valid UTF-8 range.

    My initial understanding to the question was that given a string, say "0b11111111", check if it is a valid UTF-8, I guess I was wrong.

    Moreover, Java does provide constructor to convert byte array to string, and if you are interested in the decode method, check here.

    One more thing, the above answer would be correct given another language. The only improvement could be:

    In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.

    So 4 bytes would be enough.

    I am definitely to this, so correct me if I am wrong. Thanks a lot.

提交回复
热议问题