Python3 emoji characters as unicode

℡╲_俬逩灬. 提交于 2019-12-01 06:19:45

Here's some code that will take any character that maps into two UTF-16 words and convert it to a hex sequence.

s = '\U0001f62c \U0001f60e hello'

def pairup(b):
    return [(b[i] << 8 | b[i+1]) for i in range(0, len(b), 2)]

def utf16(c):
    e = c.encode('utf_16_be')
    return ''.join(chr(x) for x in pairup(e))

u = ''.join(utf16(c) for c in s)
print(repr(u))
print(u[0] == '\ud83d' and u[1] == '\ude2c')
print(len(u))

'\ud83d\ude2c \ud83d\ude0e hello'
True
11

I thought this was going to be a no-brainer, but it turned out to be trickier than I expected. Especially since I didn't understand the problem properly the first time through.

It is not clear why do you need it but here's how you could represent non-BMP Unicode characters as surrogate pairs:

#!/usr/bin/env python3
import re

def as_surrogates(astral):
    b = astral.group().encode('utf-16be')
    return ''.join([b[i:i+2].decode('utf-16be', 'surrogatepass')
                    for i in range(0, len(b), 2)])

s = '\U0001f62c \U0001f60e hello'
u = re.sub(r'[^\u0000-\uFFFF]+', as_surrogates, s)
print(ascii(u))
assert u.encode('utf-16', 'surrogatepass').decode('utf-16') == s

Output

'\ud83d\ude2c \ud83d\ude0e hello'
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!