发表新帖

发表新帖

Why does json.dumps escape non-ascii characters with “\uxxxx”

后端未结

关注

 3  1113

In Python 2, the function json.dumps() will ensure that all non-ascii characters are escaped as \\uxxxx.

Python 2 Json

But isn\'t

相关标签:

3条回答

一个人的身影

2020-12-10 21:44

That's exactly the point. You get a byte string back, not a Unicode string. Thus the Unicode characters need to be escaped to survive. The escaping is allowed by JSON and thus presents a safe way of representing Unicode characters.

0 讨论(0)
发布评论:

提交评论
- 加载中...
余生分开走

2020-12-10 21:56
The \u in "\u00f8" isn't actually an escape sequence like \x. The \u is a literal r'\u'. But such byte strings can easily be converted to Unicode.

Demo:
```
s = "\u00f8"
u = s.decode('unicode-escape')
print repr(s), len(s), repr(u), len(u)

s = "\u2122"
u = s.decode('unicode-escape')
print repr(s), len(s), repr(u), len(u)
```
output
```
'\\u00f8' 6 u'\xf8' 1
'\\u2122' 6 u'\u2122' 1
```
As J.F.Sebastian mentions in the comments, inside a Unicode string \u00f8 is a true escape code, i.e., in a Python 3 string or in a Python 2 u"\u00f8" string. Also take heed of his other remarks!
0 讨论(0)
发布评论:

提交评论
- 加载中...
我寻月下人不归

2020-12-10 22:00
Why does json.dumps escape non-ascii characters with “\uxxxx”

Python 2 may mix ascii-only bytestrings and Unicode strings together.

It might be a premature optimization. Unicode strings may require 2-4 times more memory than corresponding bytestrings if they contain characters mostly in ASCII range in Python 2.

Also, even today, print(unicode_string) may easily fail if it contains non-ascii characters while printing to Windows console unless something like win-unicode-console Python package is installed. It may fail even on Unix if C/POSIX locale (default for init.d services, ssh, cron in many cases) is used (that implies ascii character encoding. There is C.UTF-8 but it is not always available and you have to configure it explicitly). It might explain why you might want ensure_ascii=True in some cases.

JSON format is defined for Unicode text and therefore strictly speaking json.dumps() should always return a Unicode string but it may return a bytestring if all characters are in ASCII range (xml.etree.ElementTree has similar "optimization"). It is confusing that Python 2 allows to treat an ascii-only bytestring as a Unicode string in some cases (implicit conversions are allowed). Python 3 is more strict (implicit conversions are forbidden).

ASCII-only bytestrings might be used instead of Unicode strings (with possible non-ASCII characters) to save memory and/or improve interoperability in Python 2.

To disable that behavior, use json.dumps(obj, ensure_ascii=False).

It is important to avoid confusing a Unicode string with its representation in Python source code as Python string literal or its representation in a file as JSON text.

JSON format allows to escape any character, not just Unicode characters outside ASCII range:
```
>>> import json
>>> json.loads(r'"\u0061"')
u'a'
>>> json.loads('"a"')
u'a'
```
Don't confuse it with escapes in Python string literals used in Python source code. u"\u00f8" is a single Unicode character but "\u00f8" in the output is eight characters (in Python source code, you could right it as r'"\u00f8"' == '"\\u00f8"' == u'"\\u00f8"' (backslash is special in both Python literals and json text -- double escaping may happen). Also there are no \x escapes in JSON:
```
>>> json.loads(r'"\x61"') # invalid JSON
Traceback (most recent call last):
...
ValueError: Invalid \escape: line 1 column 2 (char 1)
>>> r'"\x61"' # valid Python literal (6 characters)
'"\\x61"'
>>> '"\x61"'  # valid Python literal with escape sequence (3 characters)
'"a"'
```
The output of json.dumps() is a str, which is a byte string in Python 2. And thus shouldn't it escape characters as \xhh ?

json.dumps(obj, ensure_ascii=True) produces only printable ascii characters and therefore print repr(json.dumps(u"\xf8")) won't contain \xhh escapes that are used to represent (repr()) non-printable chars (bytes).

\u escapes can be necessary even for ascii-only input:
```
#!/usr/bin/env python2
import json
print json.dumps(map(unichr, range(128)))
```
Output
```
["\u0000", "\u0001", "\u0002", "\u0003", "\u0004", "\u0005", "\u0006", "\u0007",
"\b", "\t", "\n", "\u000b", "\f", "\r", "\u000e", "\u000f", "\u0010", "\u0011",
"\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018", "\u0019",
"\u001a", "\u001b", "\u001c", "\u001d", "\u001e", "\u001f", " ", "!", "\"", "#",
"$", "%", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", "0", "1", "2", "3",
"4", "5", "6", "7", "8", "9", ":", ";", "<", "=", ">", "?", "@", "A", "B", "C",
"D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S",
"T", "U", "V", "W", "X", "Y", "Z", "[", "\\", "]", "^", "_", "`", "a", "b", "c",
"d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s",
"t", "u", "v", "w", "x", "y", "z", "{", "|", "}", "~", "\u007f"]
```
But isn't this quite confusing because \uxxxx is a unicode character and should be used inside a unicode string

\uxxxx are 6 characters that may be interpreted as a single character in some contexts e.g., in Python source code u"\uxxxx" is a Python literal that creates a Unicode string in memory with a single Unicode character. But if you see \uxxxx in a json text; it is six characters that may represent a single Unicode character if you load it (json.loads()).

At this point, you should understand why len(json.loads('"\\\\"')) == 1.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题