Why does json.dumps escape non-ascii characters with “\uxxxx”

后端 未结 3 1077
眼角桃花
眼角桃花 2020-12-10 21:37

In Python 2, the function json.dumps() will ensure that all non-ascii characters are escaped as \\uxxxx.

Python 2 Json

But isn\'t

相关标签:
3条回答
  • 2020-12-10 21:44

    That's exactly the point. You get a byte string back, not a Unicode string. Thus the Unicode characters need to be escaped to survive. The escaping is allowed by JSON and thus presents a safe way of representing Unicode characters.

    0 讨论(0)
  • 2020-12-10 21:56

    The \u in "\u00f8" isn't actually an escape sequence like \x. The \u is a literal r'\u'. But such byte strings can easily be converted to Unicode.

    Demo:

    s = "\u00f8"
    u = s.decode('unicode-escape')
    print repr(s), len(s), repr(u), len(u)
    
    s = "\u2122"
    u = s.decode('unicode-escape')
    print repr(s), len(s), repr(u), len(u)
    

    output

    '\\u00f8' 6 u'\xf8' 1
    '\\u2122' 6 u'\u2122' 1
    

    As J.F.Sebastian mentions in the comments, inside a Unicode string \u00f8 is a true escape code, i.e., in a Python 3 string or in a Python 2 u"\u00f8" string. Also take heed of his other remarks!

    0 讨论(0)
  • 2020-12-10 22:00

    Why does json.dumps escape non-ascii characters with “\uxxxx”

    Python 2 may mix ascii-only bytestrings and Unicode strings together.

    It might be a premature optimization. Unicode strings may require 2-4 times more memory than corresponding bytestrings if they contain characters mostly in ASCII range in Python 2.

    Also, even today, print(unicode_string) may easily fail if it contains non-ascii characters while printing to Windows console unless something like win-unicode-console Python package is installed. It may fail even on Unix if C/POSIX locale (default for init.d services, ssh, cron in many cases) is used (that implies ascii character encoding. There is C.UTF-8 but it is not always available and you have to configure it explicitly). It might explain why you might want ensure_ascii=True in some cases.

    JSON format is defined for Unicode text and therefore strictly speaking json.dumps() should always return a Unicode string but it may return a bytestring if all characters are in ASCII range (xml.etree.ElementTree has similar "optimization"). It is confusing that Python 2 allows to treat an ascii-only bytestring as a Unicode string in some cases (implicit conversions are allowed). Python 3 is more strict (implicit conversions are forbidden).

    ASCII-only bytestrings might be used instead of Unicode strings (with possible non-ASCII characters) to save memory and/or improve interoperability in Python 2.

    To disable that behavior, use json.dumps(obj, ensure_ascii=False).


    It is important to avoid confusing a Unicode string with its representation in Python source code as Python string literal or its representation in a file as JSON text.

    JSON format allows to escape any character, not just Unicode characters outside ASCII range:

    >>> import json
    >>> json.loads(r'"\u0061"')
    u'a'
    >>> json.loads('"a"')
    u'a'
    

    Don't confuse it with escapes in Python string literals used in Python source code. u"\u00f8" is a single Unicode character but "\u00f8" in the output is eight characters (in Python source code, you could right it as r'"\u00f8"' == '"\\u00f8"' == u'"\\u00f8"' (backslash is special in both Python literals and json text -- double escaping may happen). Also there are no \x escapes in JSON:

    >>> json.loads(r'"\x61"') # invalid JSON
    Traceback (most recent call last):
    ...
    ValueError: Invalid \escape: line 1 column 2 (char 1)
    >>> r'"\x61"' # valid Python literal (6 characters)
    '"\\x61"'
    >>> '"\x61"'  # valid Python literal with escape sequence (3 characters)
    '"a"'
    

    The output of json.dumps() is a str, which is a byte string in Python 2. And thus shouldn't it escape characters as \xhh ?

    json.dumps(obj, ensure_ascii=True) produces only printable ascii characters and therefore print repr(json.dumps(u"\xf8")) won't contain \xhh escapes that are used to represent (repr()) non-printable chars (bytes).

    \u escapes can be necessary even for ascii-only input:

    #!/usr/bin/env python2
    import json
    print json.dumps(map(unichr, range(128)))
    

    Output

    ["\u0000", "\u0001", "\u0002", "\u0003", "\u0004", "\u0005", "\u0006", "\u0007",
    "\b", "\t", "\n", "\u000b", "\f", "\r", "\u000e", "\u000f", "\u0010", "\u0011",
    "\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018", "\u0019",
    "\u001a", "\u001b", "\u001c", "\u001d", "\u001e", "\u001f", " ", "!", "\"", "#",
    "$", "%", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", "0", "1", "2", "3",
    "4", "5", "6", "7", "8", "9", ":", ";", "<", "=", ">", "?", "@", "A", "B", "C",
    "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S",
    "T", "U", "V", "W", "X", "Y", "Z", "[", "\\", "]", "^", "_", "`", "a", "b", "c",
    "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s",
    "t", "u", "v", "w", "x", "y", "z", "{", "|", "}", "~", "\u007f"]
    

    But isn't this quite confusing because \uxxxx is a unicode character and should be used inside a unicode string

    \uxxxx are 6 characters that may be interpreted as a single character in some contexts e.g., in Python source code u"\uxxxx" is a Python literal that creates a Unicode string in memory with a single Unicode character. But if you see \uxxxx in a json text; it is six characters that may represent a single Unicode character if you load it (json.loads()).

    At this point, you should understand why len(json.loads('"\\\\"')) == 1.

    0 讨论(0)
提交回复
热议问题