Why is HTML returned by requests different from the real page HTML?

问题

Hi friends IM trying to scrap a webpage for getting some data to work with, one of the web pages I want to scrap is this one https://www.etoro.com/people/sparkliang/portfolio, the problem comes when I scrap the web page using:

import requests

h=requests.get('https://www.etoro.com/people/sparkliang/portfolio')
h.content

And gives me a completely different result HTML from the original, for example adding a lot of meta kind and deleting the text or type HTML variables I am searching for.

For example imagine I want to scrap:

<p ng-if=":: item.IsStock" class="i-portfolio-table-hat-fullname ng-binding ng-scope">Shopify Inc.</p>

I use a command like this:

    from bs4 import BeautifulSoup

    import requests

    html_text = requests.get('https://www.etoro.com/people/sparkliang/portfolio').text
    print(html_text)

    soup = BeautifulSoup(html_text,'lxml')

    job = soup.find('p', class_='i-portfolio-table-hat-fullname ng-binding ng-scope').text

This will return me Shopify Inc. But it doesnt because the html code y load or get from the web page with the requests' library , gets me another complete different html.

I want to know how to get the original html code from the web page. If you use cntl-f for searching to a keyword like Shopify Inc it wont be even in the code i get from the requests python librarry Thank yor for reading.

回答1:

It happens because the page uses dynamic javascript to create the DOM elements. So you won't be able to accomplish it using requests. Instead you should use selenium with a webdriver and wait for the elements to be created before scraping.

You can try downloading ChromeDriver executable here. And if you paste it in the same folder as your script you can run:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--window-size=1920x1080")
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\\chromedriver.exe" # CHANGE THIS IF NOT SAME FOLDER
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)

url = 'https://www.etoro.com/people/sparkliang/portfolio'
driver.get(url)
html_text = driver.page_source

jobs = WebDriverWait(driver, 20).until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'p.i-portfolio-table-hat-fullname'))
)
for job in jobs:
    print(job.text)

Here we use selenium with WebDriverWait and EC to ensure that all the elements wil exist when we try to scrape the info we're looking for.

Outputs

Facebook
Apple
Walt Disney
Alibaba
JD.com
Mastercard
...

来源：https://stackoverflow.com/questions/65186906/why-is-html-returned-by-requests-different-from-the-real-page-html

标签

python

web-scraping

python-requests