发表新帖

发表新帖

Converting to UTF-8 (again)

前端未结

关注

 3  1166

天涯浪人 2021-01-20 07:25

I\'ve this string Traor\\u0102\\u0160

Traor\\u0102\\u0160 Should produce TraorÃ©. Then TraorÃ© utf-8 decoded shou

3条回答

轻奢々 (楼主)

2021-01-20 07:49
For me your site returns "Traor\u00e9" (the last character is é):
```
r = requests.get(url)
print(json.dumps(json.loads(r.content)['Item']['LastName']))
# -> "Traor\u00e9" -> Traoré
```
r.json (r.text) produces incorrect content here. Either server or requests or both use incorrect encoding that results in "Traor\u0102\u0160". The encoding of JSON text is completely defined by its content therefore it is always possible to decode it whatever headers server sends, from json rfc:

JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.

Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
```
       00 00 00 xx  UTF-32BE
       00 xx 00 xx  UTF-16BE
       xx 00 00 00  UTF-32LE
       xx 00 xx 00  UTF-16LE
       xx xx xx xx  UTF-8
```
In this case there are no zero bytes at the start of r.content so json.loads works otherwise you need manually to convert it to a Unicode string if the server sends incorrect character encoding in Content-Type header or to workaround requests bug
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题