How can I restore proper encoding of 4 byte emoji characters that have been stored in plain utf8 - like this: 😊?

感情迁移 提交于 2019-12-10 03:12:35

问题


Is it possible to re-encode emoji 3 or 4 byte strings into emoji again?

I inherited a MySQL Innodb table with utf8_unicode_ci encoding. These emoji 4 byte strings are everywhere. Is it possible to translate them back into emoji?

First step was to modify the character set to utf8mb4. This changed all strings like � to strings like this: 😊.

But what I really want is to translate 😊 into something like

. (I have no idea if 😊 is really a smiley)

回答1:


Inspired by Ignacio Vazquez-Abrams' comment. Next python code snippet shows origin procedure Emoji to Mojibake and vice versa (repair):

print ( "\nEmoji to mojibake (origin):")
for emojiChar in ['😊','😣','👽','😎']:
    print ( emojiChar, emojiChar.encode('utf8').decode('cp1252'))

print ( "\nmojibake to Emoji (repair):")
for mojibakeString in ['😊','😣','👽','😎','🙇']:
    print ( mojibakeString, mojibakeString.encode('cp1252').decode('utf8'))

I know that the question is tagged php rather than python; let me hope that analogous php solution could be very close…

Output:

==> chcp 65001
Active code page: 65001

==> D:\test\Python\20108312.py

Emoji to mojibake (origin):
😊 😊
😣 😣
👽 👽
😎 😎

mojibake to Emoji (repair):
😊 😊
😣 😣
👽 👽
😎 😎
🙇 🙇

==>

Python version:

Python 3.5.1 (v3.5.1:37a07cee5969, Dec  6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)] on win32



回答2:


The majority of Emoji requires 21-bits, which is part of the Supplementary Multilingual Plane. Per the chart on that page, Emoji is prefixed with 1F, spanning 1F3 to 1F7. If your application simply stripped the top 5 bits to fit into 16-bits, you would have been left with 3xxx -> 7xxx. This mostly maps to Asian ideographs per this information on the Basic Multilingual Plane. Given the data you are showing is not ideographs, you may have to analyze the data more deeply to see if it is recoverable by recombining the bits together and adding the 1F prefix.



来源:https://stackoverflow.com/questions/20108312/how-can-i-restore-proper-encoding-of-4-byte-emoji-characters-that-have-been-stor

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!