Weird character not exists in html source python BeautifulSoup

问题

I have watched a video that teaches how to use BeautifulSoup and requests to scrape a website Here's the code

from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd

pages_to_scrape = 1

for i in range(1,pages_to_scrape+1):
    url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
    pages.append(url)
for item in pages:
    page = requests.get(item)
    soup = bs4(page.text, 'html.parser')
    #print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
    price=j.getText()
    print(price)

The code i working well. But as for the results I noticed weird character before the euro symbol and when checking the html source, I didn't find that character. Any ideas why this character appears? and how this be fixed .. is using replace enough or there is a better approach?

回答1:

Seems for me you explained your question wrongly. I assume that you are using Windows where your terminal IDLE is using the default encoding of cp1252,

But you are dealing with UTF-8, you've to configure your terminal/IDLE with UTF-8

import requests
from bs4 import BeautifulSoup


def main(url):
    with requests.Session() as req:
        for item in range(1, 10):
            r = req.get(url.format(item))
            print(r.url)
            soup = BeautifulSoup(r.content, 'html.parser')
            goal = [(x.h3.a.text, x.select_one("p.price_color").text)
                    for x in soup.select("li.col-xs-6")]
            print(goal)


main("http://books.toscrape.com/catalogue/page-{}.html")

try to always use The DRY Principle which means Don’t Repeat Yourself”.
Since you are dealing with the same host so you've to maintain the same session instead of keep open tcp socket stream and then close it and then open it again. That's can lead to block your requests and consider it as DDOS attack where the TCP flags got captured by the back-end. imagine that you open your browser and then open a website then you close it and repeat the circle!
Python functions is usually looks nice and easy to read instead of letting code looks like journal text.

Notes: the usage of range() and {} format string, CSS selectors.

回答2:

You could use page.content.decode('utf-8') instead of page.text. As people in the comments said, it is an encoding issue, and .content returns HTML as bytes, then you can convert it into string with right encoding using .decode('utf-8'), whereas .text returns string with bad encoding (maybe cp1252). The final code may look like this:

from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd

pages_to_scrape = 1
pages = [] # You forgot this line

for i in range(1,pages_to_scrape+1):
    url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
    pages.append(url)
for item in pages:
    page = requests.get(item)
    soup = bs4(page.content.decode('utf-8'), 'html.parser') # Replace .text with .content.decode('utf-8')
    #print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
    price=j.getText()
    print(price)

This should hopefully work

P.S: Sorry for directly writing the answer, I don't have enought reputation to write in comments :D

来源：https://stackoverflow.com/questions/65027293/weird-character-not-exists-in-html-source-python-beautifulsoup

标签

python

beautifulsoup

python-requests