get escaped unicode code from string

孤人 提交于 2019-12-24 03:08:26

问题


I seem to be having the opposite issue as everyone else in the development world. I need to generate escaped characters from strings. For instance, say I have the word MESSAGE:, I need to generate:

\\u004D\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003A\\u0053\\u0069\\u006D

The closest thing I could get using Python was:

u'MESSAGE:'.encode('utf16')
# output = '\xff\xfeM\x00E\x00S\x00S\x00A\x00G\x00E\x00:\x00'

My first thought was that I could replace \x with \u00 (or something to that effect), but I quickly realized that wouldn't work. What can I do to output the escaped (unescaped?) string in Python (preferably)?

Before everyone starts "answering" and down voting, the escaped \u00... string is what my app is getting from another 3rd party app which I have no control over. I'm trying to generate my own test data so I don't have to rely on that 3rd party app.


回答1:


I think this (quick & dirty) code does what you want:

''.join('\\u' + x.encode('utf_16_be').encode('hex') for x in u'MESSAGE:')
# output: '\\u004d\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003a'

Or if you want more '\':

''.join('\\\\u' + x.encode('utf_16_be').encode('hex') for x in u'MESSAGE:')
# output: '\\\\u004d\\\\u0045\\\\u0053\\\\u0053\\\\u0041\\\\u0047\\\\u0045\\\\u003a'
print _
# output: \\u004d\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003a

If you absolutely need upper-case for hexadecimal codes:

''.join('\\u' + x.encode('utf_16_be').encode('hex').upper() for x in u'MESSAGE:')
# output: '\\u004D\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003A'



回答2:


Pierre's answer is nearly right, but the for x in u'MESSAGE:' bit would fail for characters above U+FFFF, except for ‘narrow builds’ (primarily Python 1.6–3.2 on Windows) which use UTF-16 for Unicode strings.

On ‘wide builds’ (and in 3.3+ where the distinction no longer exists), len(unichr(0x10000)) is 1 not 2. When this code point is UTF-16BE-encoded you get two surrogates taking up four bytes, so the output is '\\uD800DC00' instead of what you probably wanted, u'\\uD800\\uDC00'.

To cover it on both variants of Python you can do:

>>> h = u'MESSAGE:\U00010000'.encode('utf-16be').encode('hex')
# '004d004500530053004100470045003ad800dc00'
>>> ''.join(r'\u' + h[i:i+4] for i in range(0, len(h), 4))
'\\u004d\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003a\\ud800\\udc00'



回答3:


There's no need to go through the .encode() step if you don't have characters outside the BMP (>0xFFFF):

>>> ''.join('\\u{:04x}'.format(ord(a)) for a in u'Message')
'\\u004d\\u0065\\u0073\\u0073\\u0061\\u0067\\u0065'


来源:https://stackoverflow.com/questions/27432656/get-escaped-unicode-code-from-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!