I am reading a byte sequence from a stream. Assume for the sake of argument, that the sequence is of a fixed length and I read the whole thing into a byte array (in my case
If the contents are expected to be written in a language using the Latin script, simply counting nulls will detect UTF-16. In UTF-8, null bytes will decode to NUL control character, and they don't appear in text normally.
Languages written in other scripts cannot be fully valid in both UTF-16 and UTF-8 unless it's artificially constructed to be so.
So, first detect if it's fully valid UTF-8 sequence on its own:
If the above resulted in UTF-16, that's not enough as you have to know the endianess as well. With languages written in Latin script, the amount of odd or even null bytes will tell this.
So, does that mean there's no way to generically figure out which one it is?
That's right. The byte string [0x30, 0x30]
can be the UTF-8 string 00
or the UTF-16 encoding of the character 〰
. That's a WAVY DASH, in case you were wondering.
There are a few more heuristics to try:
If those fail, you'll have to default to either one of the options, or do some kind of check on the contents of the string when decoded with both -8 and -16.