问题
I have watched a video that teaches how to use BeautifulSoup and requests to scrape a website Here's the code
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
pages_to_scrape = 1
for i in range(1,pages_to_scrape+1):
url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = bs4(page.text, 'html.parser')
#print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
price=j.getText()
print(price)
The code i working well. But as for the results I noticed weird character before the euro symbol and when checking the html source, I didn't find that character. Any ideas why this character appears? and how this be fixed .. is using replace enough or there is a better approach?
回答1:
Seems for me you explained your question wrongly. I assume that you are using Windows where your terminal IDLE is using the default encoding of cp1252
,
But you are dealing with UTF-8
, you've to configure your terminal/IDLE with UTF-8
import requests
from bs4 import BeautifulSoup
def main(url):
with requests.Session() as req:
for item in range(1, 10):
r = req.get(url.format(item))
print(r.url)
soup = BeautifulSoup(r.content, 'html.parser')
goal = [(x.h3.a.text, x.select_one("p.price_color").text)
for x in soup.select("li.col-xs-6")]
print(goal)
main("http://books.toscrape.com/catalogue/page-{}.html")
- try to always use
The DRY Principle
which means Don’t Repeat Yourself”. - Since you are dealing with the same
host
so you've to maintain the same session instead of keep opentcp
socket stream and then close it and then open it again. That's can lead to block your requests and consider it asDDOS
attack where theTCP
flags got captured by the back-end. imagine that you open your browser and then open a website then you close it and repeat the circle! - Python
functions
is usually looks nice and easy to read instead of letting code looks like journal text.
Notes: the usage of
range()
and{}
format string,CSS
selectors.
回答2:
You could use page.content.decode('utf-8')
instead of page.text
. As people in the comments said, it is an encoding issue, and .content
returns HTML as bytes, then you can convert it into string with right encoding using .decode('utf-8')
, whereas .text
returns string with bad encoding (maybe cp1252). The final code may look like this:
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
pages_to_scrape = 1
pages = [] # You forgot this line
for i in range(1,pages_to_scrape+1):
url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = bs4(page.content.decode('utf-8'), 'html.parser') # Replace .text with .content.decode('utf-8')
#print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
price=j.getText()
print(price)
This should hopefully work
P.S: Sorry for directly writing the answer, I don't have enought reputation to write in comments :D
来源:https://stackoverflow.com/questions/65027293/weird-character-not-exists-in-html-source-python-beautifulsoup