Check if byte sequence contains utf-16

后端未结

关注

 2  912

故里飘歌 2020-12-21 07:12

I am reading a byte sequence from a stream. Assume for the sake of argument, that the sequence is of a fixed length and I read the whole thing into a byte array (in my case

2条回答

情深已故 (楼主)

2020-12-21 07:38
If the contents are expected to be written in a language using the Latin script, simply counting nulls will detect UTF-16. In UTF-8, null bytes will decode to NUL control character, and they don't appear in text normally.

Languages written in other scripts cannot be fully valid in both UTF-16 and UTF-8 unless it's artificially constructed to be so.

So, first detect if it's fully valid UTF-8 sequence on its own:
- If yes, check for null bytes, and if there are some, it's UTF-16. Otherwise it's UTF-8.
- If not, it's UTF-16.
If the above resulted in UTF-16, that's not enough as you have to know the endianess as well. With languages written in Latin script, the amount of odd or even null bytes will tell this.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...