Encoding issue of a character in utf-8

喜你入骨 提交于 2019-12-22 10:28:07

问题


I get a link from a web page by using beautiful soup library through a.get('href'). In the link there is a strange character ® but when I get it became ®. How can I encode it properly? I have already added at the beginning of the page # -*- coding: utf-8 -*-

r = requests.get(url)

soup = BeautifulSoup(r.text)

回答1:


Do not use r.text; leave decoding to BeautifulSoup:

soup = BeautifulSoup(r.content)

r.content gives you the response in bytes, without decoding. r.text on the other hand, is the response decoded to unicode.

What happens is that the server did not include the character-set in the response headers. At that moment, requests follows the HTTP RFC 2261, section 3.7.1: text/ responses by default are expected to use the ISO-8859-1 (Latin 1) character set.

For your HTML page, that default is wrong, and you got incorrect results; r.text decoded the bytes as Latin-1, resulting in a Mojibake:

>>> print u'®'.encode('utf8').decode('latin1')
®

HTML can itself include the correct encoding in the HTML page itself, in the form of a <meta> tag in the HTML header. BeautifulSoup will use that header and decode the bytes for you.

Even if the <meta> header tag is missing, BeautifulSoup includes other methods to auto-detect encodings.



来源:https://stackoverflow.com/questions/24790258/encoding-issue-of-a-character-in-utf-8

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!