问题
Is it possible to re-encode emoji 3 or 4 byte strings into emoji again?
I inherited a MySQL Innodb table with utf8_unicode_ci encoding. These emoji 4 byte strings are everywhere. Is it possible to translate them back into emoji?
First step was to modify the character set
to utf8mb4
. This changed all strings like �
to strings like this: 😊
.
But what I really want is to translate 😊
into something like

😊
is really a smiley)
回答1:
Inspired by Ignacio Vazquez-Abrams' comment. Next python code snippet shows origin procedure Emoji to Mojibake and vice versa (repair):
print ( "\nEmoji to mojibake (origin):")
for emojiChar in ['😊','😣','👽','😎']:
print ( emojiChar, emojiChar.encode('utf8').decode('cp1252'))
print ( "\nmojibake to Emoji (repair):")
for mojibakeString in ['😊','😣','👽','😎','🙇']:
print ( mojibakeString, mojibakeString.encode('cp1252').decode('utf8'))
I know that the question is tagged php rather than python; let me hope that analogous php solution could be very close…
Output:
==> chcp 65001
Active code page: 65001
==> D:\test\Python\20108312.py
Emoji to mojibake (origin):
😊 😊
😣 😣
👽 👽
😎 😎
mojibake to Emoji (repair):
😊 😊
😣 😣
👽 👽
😎 😎
🙇 🙇
==>
Python version:
Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)] on win32
回答2:
The majority of Emoji requires 21-bits, which is part of the Supplementary Multilingual Plane. Per the chart on that page, Emoji is prefixed with 1F
, spanning 1F3
to 1F7
. If your application simply stripped the top 5 bits to fit into 16-bits, you would have been left with 3xxx -> 7xxx. This mostly maps to Asian ideographs per this information on the Basic Multilingual Plane. Given the data you are showing is not ideographs, you may have to analyze the data more deeply to see if it is recoverable by recombining the bits together and adding the 1F
prefix.
来源:https://stackoverflow.com/questions/20108312/how-can-i-restore-proper-encoding-of-4-byte-emoji-characters-that-have-been-stor