get escaped unicode code from string

问题

I seem to be having the opposite issue as everyone else in the development world. I need to generate escaped characters from strings. For instance, say I have the word MESSAGE:, I need to generate:

\\u004D\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003A\\u0053\\u0069\\u006D

The closest thing I could get using Python was:

u'MESSAGE:'.encode('utf16')
# output = '\xff\xfeM\x00E\x00S\x00S\x00A\x00G\x00E\x00:\x00'

My first thought was that I could replace \x with \u00 (or something to that effect), but I quickly realized that wouldn't work. What can I do to output the escaped (unescaped?) string in Python (preferably)?

Before everyone starts "answering" and down voting, the escaped \u00... string is what my app is getting from another 3rd party app which I have no control over. I'm trying to generate my own test data so I don't have to rely on that 3rd party app.

回答1:

I think this (quick & dirty) code does what you want:

''.join('\\u' + x.encode('utf_16_be').encode('hex') for x in u'MESSAGE:')
# output: '\\u004d\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003a'

Or if you want more '\':

''.join('\\\\u' + x.encode('utf_16_be').encode('hex') for x in u'MESSAGE:')
# output: '\\\\u004d\\\\u0045\\\\u0053\\\\u0053\\\\u0041\\\\u0047\\\\u0045\\\\u003a'
print _
# output: \\u004d\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003a

If you absolutely need upper-case for hexadecimal codes:

''.join('\\u' + x.encode('utf_16_be').encode('hex').upper() for x in u'MESSAGE:')
# output: '\\u004D\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003A'

回答2:

Pierre's answer is nearly right, but the for x in u'MESSAGE:' bit would fail for characters above U+FFFF, except for ‘narrow builds’ (primarily Python 1.6–3.2 on Windows) which use UTF-16 for Unicode strings.

On ‘wide builds’ (and in 3.3+ where the distinction no longer exists), len(unichr(0x10000)) is 1 not 2. When this code point is UTF-16BE-encoded you get two surrogates taking up four bytes, so the output is '\\uD800DC00' instead of what you probably wanted, u'\\uD800\\uDC00'.

To cover it on both variants of Python you can do:

>>> h = u'MESSAGE:\U00010000'.encode('utf-16be').encode('hex')
# '004d004500530053004100470045003ad800dc00'
>>> ''.join(r'\u' + h[i:i+4] for i in range(0, len(h), 4))
'\\u004d\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003a\\ud800\\udc00'

回答3:

There's no need to go through the .encode() step if you don't have characters outside the BMP (>0xFFFF):

>>> ''.join('\\u{:04x}'.format(ord(a)) for a in u'Message')
'\\u004d\\u0065\\u0073\\u0073\\u0061\\u0067\\u0065'

来源：https://stackoverflow.com/questions/27432656/get-escaped-unicode-code-from-string

标签

python

unicode

escaping