getting bytes from unicode string in python

本秂侑毒 提交于 2019-12-10 12:44:53

问题


I have an 16bit big endian unicode string represented as u'\u4132',

how can I split it into integers 41 and 32 in python ?


回答1:


Here are a variety of different ways you may want it.

Python 2:

>>> chars = u'\u4132'.encode('utf-16be')
>>> chars
'A2'
>>> ord(chars[0])
65
>>> '%x' % ord(chars[0])
'41'
>>> hex(ord(chars[0]))
'0x41'
>>> ['%x' % ord(c) for c in chars]
['41', '32']
>>> [hex(ord(c)) for c in chars]
['0x41', '0x32']

Python 3:

>>> chars = '\u4132'.encode('utf-16be')
>>> chars
b'A2'
>>> chars = bytes('\u4132', 'utf-16be')
>>> chars  # Just the same.
b'A2'
>>> chars[0]
65
>>> '%x' % chars[0]
'41'
>>> hex(chars[0])
'0x41'
>>> ['%x' % c for c in chars]
['41', '32']
>>> [hex(c) for c in chars]
['0x41', '0x32']



回答2:


  • Java: "\u4132".getBytes("UTF-16BE")
  • Python 2: u'\u4132'.encode('utf-16be')
  • Python 3: '\u4132'.encode('utf-16be')

These methods return a byte array, which you can convert to an int array easily. But note that code points above U+FFFF will be encoded using two code units (so with UTF-16BE this means 32 bits or 4 bytes).




回答3:


"Those" aren't integers, it's a hexadecimal number which represents the code point.

If you want to get an integer representation of the code point you need to use ord(u'\u4132') if you now want to convert that back to the unicode character use unicode() which will return a unicode string.




回答4:


>>> c = u'\u4132'
>>> '%x' % ord(c)
'4132'



回答5:


Dirty hack: repr(u'\u4132') will return "u'\\u4132'"




回答6:


Pass the unicode character to ord() to get its code point and then break that code point into individual bytes with int.to_bytes() and then format the output however you want:

list(map(lambda b: hex(b)[2:], ord('\u4132').to_bytes(4, 'big')))

returns: ['0', '0', '41', '32']

list(map(lambda b: hex(b)[2:], ord('\N{PILE OF POO}').to_bytes(4, 'big')))

returns: ['0', '1', 'f4', 'a9']

As I have mentioned on another comment, encoding the code point to utf16 will not work as expected for code points outside the BMP (Basic Multilingual Plane) since UTF16 will need a surrogate pair to encode those code points.



来源:https://stackoverflow.com/questions/4239666/getting-bytes-from-unicode-string-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!