Is the [0xff, 0xfe] prefix required on utf-16 encoded strings?

问题

Rewritten question!

I am working with a vendor's device that requires "unicode encoding" of strings, where each character is represented in two bytes. My strings will always be ASCII based, so I thought this would be the way to translate my string into the vendor's string:

>>> b1 = 'abc'.encode('utf-16')

But examining the result, I see that there's a leading [0xff, 0xfe] on the bytearray:

>>> [hex(b) for b in b1]
['0xff', '0xfe', '0x61', '0x0', '0x62', '0x0', '0x63', '0x0']

Since the vendor's device is not expecting the [0xff, 0xfe], I can strip it off...

>>> b2 = 'abc'.encode('utf-16')[2:]
>>> [hex(b) for b in b2]
['0x61', '0x0', '0x62', '0x0', '0x63', '0x0']

... which is what I want.

But what really surprises me that I can decode b1 and b2 and they both reconstitute to the original string:

>>> b1.decode('utf-16') == b2.decode('utf-16')
True

So my two intertwined questions:

What is the significance of the 0xff, 0xfe on the head of the encoded bytes?
Is there any hazard in stripping off the 0xff, 0xfe prefix, as with b2 above?

回答1:

This observation

... what really surprises me that I can decode b1 and b2 and they both reconstitute to the original string:
b1.decode('utf-16') == b2.decode('utf-16')
True

suggests there is a built-in default, because there are two possible arrangements for the 16-bit wide UTF-16 codes: Big and Little Endian.

Normally, Python deduces the endianness to use from the BOM when reading – and so it always adds one when writing. If you want to force a specific endianness, you can use the explicit encodings utf-16-le and utf-16-be:

… when such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as ‘utf-16-le’ and ‘utf-16-be’ for little-endian and big-endian encodings, that specify one particular byte ordering and don’t skip the BOM.
(https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data)

But if you do not use a specific ordering, then what default gets used? The original Unicode proposal, PEP 100, warns

Note: 'utf-16' should be implemented by using and requiring byte order marks (BOM) for file input/output.
(https://www.python.org/dev/peps/pep-0100/, my emph.)

Yet it works for you. If we look up in the Python source code how this is managed, we find this comment in _codecsmodule.c:

/* This version provides access to the byteorder parameter of the
   builtin UTF-16 codecs as optional third argument. It defaults to 0
   which means: use the native byte order and prepend the data with a
   BOM mark.
*/

and deeper, in unicodeobject.c,

/* Check for BOM marks (U+FEFF) in the input and adjust current
   byte order setting accordingly. In native mode, the leading BOM
   mark is skipped, in all other modes, it is copied to the output
   stream as-is (giving a ZWNBSP character). */

So initially, the byte order is set to the default for your system, and when you start decoding UTF-16 data and a BOM follows, the byte order gets set to whatever this specifies. The "native order" in this last comment refers to whether or not a certain byte order has been explicitly declared OR has been encountered by way of a BOM; and when neither is true, it will use your system's endianness.

回答2:

This is the byte order mark. It's a prefix to a UTF document that indicates what endianness the document uses. It does this by encoding the code point 0xFEFF in the byte order - in this case, little endian (less significant byte first). Anything trying to read it the other way around, in big endian (more significant byte first), will read the first character as 0xFFFE, which is a code point that is specifically not a valid character, informing the reader it needs to error or switch endianness for the rest of the document.

回答3:

It is the byte order mark, a.k.a. BOM: see https://en.wikipedia.org/wiki/UTF-16 (look at the subheadin gByte order encoding schemes). It's purpose is to allow the decoder to detect if the encoding is little-endian or big-endian.

回答4:

It is the Unicode byte order mark encoded in UTF-16. Its purpose is to communicate the byte order to a reader expecting text encoded with a Unicode character encoding.

You can omit it if the reader otherwise knows or comes to know the byte order.

'abc'.encode('utf-16-le')

回答5:

The answers, and especially the comment from usr2564301 are helpful: the 0xff 0xfe prefix is the "Byte Order Marker", and it carries the endian-ness information along with the byte string. If you know which endian-ness you want, you can specify utf-16-le or utf-16-be as part of the encoding.

This makes it clear:

>>> 'abc'.encode('utf-16').hex()
'fffe610062006300'
>>> 'abc'.encode('utf-16-le').hex()
'610062006300'
>>> 'abc'.encode('utf-16-be').hex()
'006100620063'

来源：https://stackoverflow.com/questions/53305258/is-the-0xff-0xfe-prefix-required-on-utf-16-encoded-strings

标签

python-3.x

utf-16