Python3 emoji characters as unicode

问题

I have a string in python3 that has emojis in it and I want to treat the emojis as their unicode representation. I need to do some manipulation on the emoji in this format.

s = '😬 😎 hello'

This treats each emoji as its own character such that len(s) == 9 && s[0] == 😬

I want to be change the format of the string so that it is in unicode points such that

s = '😬 😎 hello'
u = to_unicode(s)   # Some function to change the format.
print(u) # '\ud83d\ude2c \ud83d\ude0e hello'
u[0] == '\ud83d' and u[1] == '\ude2c'
len(u) == 11

Any thoughts on creating a function to_unicode that will take s and change it into u? I could be thinking about how strings/unicode works in python3 wrong so any help/corrections would be greatly appreciated.

回答1:

Here's some code that will take any character that maps into two UTF-16 words and convert it to a hex sequence.

s = '\U0001f62c \U0001f60e hello'

def pairup(b):
    return [(b[i] << 8 | b[i+1]) for i in range(0, len(b), 2)]

def utf16(c):
    e = c.encode('utf_16_be')
    return ''.join(chr(x) for x in pairup(e))

u = ''.join(utf16(c) for c in s)
print(repr(u))
print(u[0] == '\ud83d' and u[1] == '\ude2c')
print(len(u))

'\ud83d\ude2c \ud83d\ude0e hello'
True
11

I thought this was going to be a no-brainer, but it turned out to be trickier than I expected. Especially since I didn't understand the problem properly the first time through.

回答2:

It is not clear why do you need it but here's how you could represent non-BMP Unicode characters as surrogate pairs:

#!/usr/bin/env python3
import re

def as_surrogates(astral):
    b = astral.group().encode('utf-16be')
    return ''.join([b[i:i+2].decode('utf-16be', 'surrogatepass')
                    for i in range(0, len(b), 2)])

s = '\U0001f62c \U0001f60e hello'
u = re.sub(r'[^\u0000-\uFFFF]+', as_surrogates, s)
print(ascii(u))
assert u.encode('utf-16', 'surrogatepass').decode('utf-16') == s

Output

'\ud83d\ude2c \ud83d\ude0e hello'

来源：https://stackoverflow.com/questions/32826565/python3-emoji-characters-as-unicode

标签

python-3.x

unicode

emoji