getting bytes from unicode string in python

问题

I have an 16bit big endian unicode string represented as u'\u4132',

how can I split it into integers 41 and 32 in python ?

回答1:

Here are a variety of different ways you may want it.

Python 2:

>>> chars = u'\u4132'.encode('utf-16be')
>>> chars
'A2'
>>> ord(chars[0])
65
>>> '%x' % ord(chars[0])
'41'
>>> hex(ord(chars[0]))
'0x41'
>>> ['%x' % ord(c) for c in chars]
['41', '32']
>>> [hex(ord(c)) for c in chars]
['0x41', '0x32']

Python 3:

>>> chars = '\u4132'.encode('utf-16be')
>>> chars
b'A2'
>>> chars = bytes('\u4132', 'utf-16be')
>>> chars  # Just the same.
b'A2'
>>> chars[0]
65
>>> '%x' % chars[0]
'41'
>>> hex(chars[0])
'0x41'
>>> ['%x' % c for c in chars]
['41', '32']
>>> [hex(c) for c in chars]
['0x41', '0x32']

回答2:

Java: "\u4132".getBytes("UTF-16BE")
Python 2: u'\u4132'.encode('utf-16be')
Python 3: '\u4132'.encode('utf-16be')

These methods return a byte array, which you can convert to an int array easily. But note that code points above U+FFFF will be encoded using two code units (so with UTF-16BE this means 32 bits or 4 bytes).

回答3:

"Those" aren't integers, it's a hexadecimal number which represents the code point.

If you want to get an integer representation of the code point you need to use ord(u'\u4132') if you now want to convert that back to the unicode character use unicode() which will return a unicode string.

回答4:

>>> c = u'\u4132'
>>> '%x' % ord(c)
'4132'

回答5:

Dirty hack: repr(u'\u4132') will return "u'\\u4132'"

回答6:

Pass the unicode character to ord() to get its code point and then break that code point into individual bytes with int.to_bytes() and then format the output however you want:

list(map(lambda b: hex(b)[2:], ord('\u4132').to_bytes(4, 'big')))

returns: ['0', '0', '41', '32']

list(map(lambda b: hex(b)[2:], ord('\N{PILE OF POO}').to_bytes(4, 'big')))

returns: ['0', '1', 'f4', 'a9']

As I have mentioned on another comment, encoding the code point to utf16 will not work as expected for code points outside the BMP (Basic Multilingual Plane) since UTF16 will need a surrogate pair to encode those code points.

来源：https://stackoverflow.com/questions/4239666/getting-bytes-from-unicode-string-in-python

标签

python

unicode

byte