How to decode the unicode string starting with “%u” (percent symbol + u) in Python 3

问题

I get some HTML code like the following:

<new>8003,%u767E%u5723%u5E97,113734,%u4E50%u4E8B%u542E%u6307%u7EA2%u70E7%u8089%u5473,6924743915824,%u7F50,104g,3,21.57,-2.16,0,%u4E50%u4E8B,1</new>

I know I can find and replace all the "%u" with "/u" in Notepad++, and then paste it into Python console to let it display correctly in Chinese characters. But how can I do it automatically in Python?

回答1:

Assuming that your input string contains "percent-u" encoded chracters, we can find and decode them with a regex replace and a callback function.

Percent-u encoding represents a Unicode code point as four hexadecimal digits: %u767E ⇒ 767E ⇒ codepoint 30334 ⇒ 百.

import re

def hex_to_char(hex_str):
    """ converts a single hex-encoded character 'FFFF' into the corresponding real character """
    return chr(int(hex_str, 16))

s = "<new>8003,%u767E%u5723%u5E97,113734,%u4E50%u4E8B%u542E%u6307%u7EA2%u70E7%u8089%u5473,6924743915824,%u7F50,104g,3,21.57,-2.16,0,%u4E50%u4E8B,1</new>"

percent_u = re.compile(r"%u([0-9a-fA-F]{4})")

decoded = percent_u.sub(lambda m: hex_to_char(m.group(1)), s)

print(decoded)

which prints

<new>8003,百圣店,113734,乐事吮指红烧肉味,6924743915824,罐,104g,3,21.57,-2.16,0,乐事,1</new>

来源：https://stackoverflow.com/questions/61478782/how-to-decode-the-unicode-string-starting-with-u-percent-symbol-u-in-pyth

标签

python

python-3.x

unicode

encoding

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!