Check if byte sequence contains utf-16

孤街醉人 提交于 2019-11-28 09:20:47

问题


I am reading a byte sequence from a stream. Assume for the sake of argument, that the sequence is of a fixed length and I read the whole thing into a byte array (in my case it's vector<char> but it's not important for this question). This byte sequence contains a string, which my be either in utf-16 or in utf-8 encoding. Unfortunately, there's no indicator of which one it is.

I can verify whether the byte sequence represents a valid utf-16 encoding and also whether it represents a valid utf-8 encoding, but I can also imaging how the same sequence of bytes may be a valid utf-8 and a valid utf-16 at the same time.

So, does that mean there's no way to generically figure out which one it is?


回答1:


If the contents are expected to be written in a language using the Latin script, simply counting nulls will detect UTF-16. In UTF-8, null bytes will decode to NUL control character, and they don't appear in text normally.

Languages written in other scripts cannot be fully valid in both UTF-16 and UTF-8 unless it's artificially constructed to be so.

So, first detect if it's fully valid UTF-8 sequence on its own:

  • If yes, check for null bytes, and if there are some, it's UTF-16. Otherwise it's UTF-8.
  • If not, it's UTF-16.

If the above resulted in UTF-16, that's not enough as you have to know the endianess as well. With languages written in Latin script, the amount of odd or even null bytes will tell this.




回答2:


So, does that mean there's no way to generically figure out which one it is?

That's right. The byte string [0x30, 0x30] can be the UTF-8 string 00 or the UTF-16 encoding of the character . That's a WAVY DASH, in case you were wondering.

There are a few more heuristics to try:

  • You can check whether the string begins with a BOM (Windows programs love those), since neither BOM is a valid start of a UTF-8 sequence.
  • If you're sure there are no NUL characters in the string, then every even-length string containing zero bytes must be UTF-16.

If those fail, you'll have to default to either one of the options, or do some kind of check on the contents of the string when decoded with both -8 and -16.



来源:https://stackoverflow.com/questions/14196386/check-if-byte-sequence-contains-utf-16

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!