How can I restore proper encoding of 4 byte emoji characters that have been stored in plain utf8 - like this: ðŸ˜Š?

问题

Is it possible to re-encode emoji 3 or 4 byte strings into emoji again?

I inherited a MySQL Innodb table with utf8_unicode_ci encoding. These emoji 4 byte strings are everywhere. Is it possible to translate them back into emoji?

First step was to modify the character set to utf8mb4. This changed all strings like ð��£ to strings like this: ðŸ˜Š.

But what I really want is to translate ðŸ˜Š into something like

. (I have no idea if ðŸ˜Š is really a smiley)

回答1:

Inspired by Ignacio Vazquez-Abrams' comment. Next python code snippet shows origin procedure Emoji to Mojibake and vice versa (repair):

print ( "\nEmoji to mojibake (origin):")
for emojiChar in ['😊','😣','👽','😎']:
    print ( emojiChar, emojiChar.encode('utf8').decode('cp1252'))

print ( "\nmojibake to Emoji (repair):")
for mojibakeString in ['ðŸ˜Š','ðŸ˜£','ðŸ‘½','ðŸ˜Ž','ðŸ™‡']:
    print ( mojibakeString, mojibakeString.encode('cp1252').decode('utf8'))

I know that the question is tagged php rather than python; let me hope that analogous php solution could be very close…

Output:

==> chcp 65001
Active code page: 65001

==> D:\test\Python\20108312.py

Emoji to mojibake (origin):
😊 ðŸ˜Š
😣 ðŸ˜£
👽 ðŸ‘½
😎 ðŸ˜Ž

mojibake to Emoji (repair):
ðŸ˜Š 😊
ðŸ˜£ 😣
ðŸ‘½ 👽
ðŸ˜Ž 😎
ðŸ™‡ 🙇

==>

Python version:

Python 3.5.1 (v3.5.1:37a07cee5969, Dec  6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)] on win32

回答2:

The majority of Emoji requires 21-bits, which is part of the Supplementary Multilingual Plane. Per the chart on that page, Emoji is prefixed with 1F, spanning 1F3 to 1F7. If your application simply stripped the top 5 bits to fit into 16-bits, you would have been left with 3xxx -> 7xxx. This mostly maps to Asian ideographs per this information on the Basic Multilingual Plane. Given the data you are showing is not ideographs, you may have to analyze the data more deeply to see if it is recoverable by recombining the bits together and adding the 1F prefix.

来源：https://stackoverflow.com/questions/20108312/how-can-i-restore-proper-encoding-of-4-byte-emoji-characters-that-have-been-stor

标签

php

unicode

encoding

character-encoding

emoji