python requests.get() returns improperly decoded text instead of UTF-8?

前端 未结 4 2117
野性不改
野性不改 2020-11-29 22:57

When the content-type of the server is \'Content-Type:text/html\', requests.get() returns improperly encoded data.

However, if

4条回答
  •  盖世英雄少女心
    2020-11-29 23:30

    Educated guesses (mentioned above) are probably just a check for Content-Type header as being sent by server (quite misleading use of educated imho).

    For response header Content-Type: text/html the result is ISO-8859-1 (default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).

    For response header Content-Type: text/html; charset=utf-8 the result is UTF-8.

    Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding), so you usually want to do:

    r = requests.get("https://martin.slouf.name/")
    # override encoding by real educated guess as provided by chardet
    r.encoding = r.apparent_encoding
    # access the data
    r.text
    

提交回复
热议问题