Check if byte sequence contains utf-16

后端 未结 2 911
故里飘歌
故里飘歌 2020-12-21 07:12

I am reading a byte sequence from a stream. Assume for the sake of argument, that the sequence is of a fixed length and I read the whole thing into a byte array (in my case

2条回答
  •  一个人的身影
    2020-12-21 07:39

    So, does that mean there's no way to generically figure out which one it is?

    That's right. The byte string [0x30, 0x30] can be the UTF-8 string 00 or the UTF-16 encoding of the character . That's a WAVY DASH, in case you were wondering.

    There are a few more heuristics to try:

    • You can check whether the string begins with a BOM (Windows programs love those), since neither BOM is a valid start of a UTF-8 sequence.
    • If you're sure there are no NUL characters in the string, then every even-length string containing zero bytes must be UTF-16.

    If those fail, you'll have to default to either one of the options, or do some kind of check on the contents of the string when decoded with both -8 and -16.

提交回复
热议问题