WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER. With Requests and Beastuifulsoup

无人久伴 提交于 2020-01-03 16:00:32

问题


I had this web scraping code working a few minutes ago, but now I get this warning and encoding. Since this request doesn't return html, Beautifulsoup is returning a None type when I search for the contents of a tag. What is going wrong here? I tried to google a bit for this encoding problem, but couldn't find a clear answer.

import requests
from bs4 import BeautifulSoup


url = 'http://finance.yahoo.com/q?s=aapl&fr=uh3_finance_web&uhb=uhb2'

data = requests.get(url)
soup = BeautifulSoup(data.content).text
print(data)

Here are the results:

0.0 seconds
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]> 
{}

Process finished with exit code 0

回答1:


The constructor of Beautifulsoup below worked for me:

soup = BeautifulSoup(open(html_path, 'r'),"html.parser",from_encoding="iso-8859-1")



回答2:


response = urlopen(notiurl)
html = response.read().decode(encoding="iso-8859-1")
soup = BeautifulSoup(html, 'html.parser')

check the coding--->print(soup.original_encoding)

DOCUMENTATION ----> https://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings



来源:https://stackoverflow.com/questions/30110289/warningrootsome-characters-could-not-be-decoded-and-were-replaced-with-replac

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!