What's the point of UTF-16?

前端 未结 5 976
太阳男子
太阳男子 2021-01-31 01:29

I\'ve never understood the point of UTF-16 encoding. If you need to be able to treat strings as random access (i.e. a code point is the same as a code unit) then you need UTF-3

5条回答
  •  天命终不由人
    2021-01-31 01:40

    None of the replies indicating an advantage of UTF-16 over UTF-8 make any sense, except for the backwards-compatibility reply.

    Well, there are two caveats to my comment.

    Erik states: "UTF-16 covers the entire BMP with single units - So unless you have a need for the rarer characters outside the BMP, UTF-16 is effectively 2 bytes per character."

    Caveat 1)

    If you can be certain that your application will NEVER need any character outside of the BMP, and that any library code you write for use with it will NEVER be used with any application that will ever need a character outside the BMP, then you could use UTF-16, and write code that makes the implicit assumption that every character will be exactly two bytes in length.

    That seems exceedingly dangerous (actually, stupid).

    If your code assumes that all UTF-16 characters are two bytes in length, and your program interacts with an application or library where there is a single character outside of the BMP, then your code will break. Code that examines or manipulates UTF-16 must be written to handle the case of a UTF-16 character requiring more than 2 bytes; therefore, I am "dismissing" this caveat.

    UTF-16 is not simpler to code for than UTF-8 (code for both must handle variable-length characters).

    Caveat 2)

    UTF-16 MIGHT be more computationally efficient, under some circumstances, if suitably written.

    Like this: Suppose that certain long strings are seldom modified, but often examined (or better, never modified once built - i.e., a string builder creating unmodifiable strings). A flag could be set for each string, indicating whether the string contains only "fixed length" characters (i.e., contains no characters that are not exactly two bytes in length). Strings for which the flag is true could be examined with optimized code that assumes fixed length (2 byte) characters.

    How about space-efficiency?

    UTF-16 is, obviously, more efficient for A) characters for which UTF-16 requires fewer bytes to encode than does UTF-8.

    UTF-8 is, obviously, more efficient for B) characters for which UTF-8 requires fewer bytes to encode than does UTF-16.

    Except for very "specialized" text, it's likely that count(B) far exceeds count(A).

提交回复
热议问题