Converting Unicode sequences to a string in Python 3

ε祈祈猫儿з 提交于 2019-12-08 08:01:41

问题


In parsing an HTML response to extract data with Python 3.4 on Kubuntu 15.10 in the Bash CLI, using print() I am getting output that looks like this:

\u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

How would I output the actual text itself in my application?

This is the code generating the string:

response = requests.get(url)
messages = json.loads( extract_json(response.text) )

for k,v in messages.items():
    for message in v['foo']['bar']:
        print("\nFoobar: %s" % (message['body'],))

Here is the function which returns the JSON from the HTML page:

def extract_json(input_):

    """
    Get the JSON out of a webpage.
    The line of interest looks like this:
    foobar = ["{\"name\":\"dotan\",\"age\":38}"]
    """

    for line in input_.split('\n'):
        if 'foobar' in line:
            return line[line.find('"')+1:-2].replace(r'\"',r'"')

    return None

In googling the issue, I've found quite a bit of information relating to Python 2, however Python 3 has completely changed how strings and especially Unicode are handled in Python.

How can I convert the example string (\u05ea) to characters (ת) in Python 3?

Addendum:

Here is some information regarding message['body']:

print(type(message['body']))
# Prints: <class 'str'>

print(message['body'])
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

print(repr(message['body']))
# Prints: '\\u05ea\u05d4 \\u05e0\\u05e9\\u05de\\u05e2 \\u05de\\u05e6\\u05d5\\u05d9\\u05df'

print(message['body'].encode().decode())
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

print(message['body'].encode().decode('unicode-escape'))
# Prints: תה נשמע מצוין

Note that the last line does work as expected, but it has a few issues:

  • Decoding string literals with unicode-escape is the wrong thing as Python escapes are different to JSON escapes for many characters. (Thank you bobince)
  • encode() relies on the default encoding, which is a bad thing.(Thank you bobince)
  • The encode() fails on some newer Unicode characters, such as \ud83d\ude03, with UnicodeEncodeError "surrogates not allowed".

回答1:


It appears your input uses backslash as an escape character, you should unescape the text before passing it to json:

>>> foobar = '{\\"body\\": \\"\\\\u05e9\\"}'
>>> import re
>>> json_text = re.sub(r'\\(.)', r'\1', foobar) # unescape
>>> import json
>>> print(json.loads(json_text)['body'])
ש

Don't use 'unicode-escape' encoding on JSON text; it may produce different results:

>>> import json
>>> json_text = '["\\ud83d\\ude02"]'
>>> json.loads(json_text)
['😂']
>>> json_text.encode('ascii', 'strict').decode('unicode-escape') #XXX don't do it
'["\ud83d\ude02"]'

'😂' == '\U0001F602' is U+1F602 (FACE WITH TEARS OF JOY).



来源:https://stackoverflow.com/questions/33468209/converting-unicode-sequences-to-a-string-in-python-3

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!