Beautiful Soup not waiting until page is fully loaded

问题

So with my code below I want to open an apartment website URL and scrape the webpage. The only issue is that Beautiful Soup isn't waiting until the entire webpage is rendered. The apartments aren't rendered in the html until they are loaded on the page, which takes a few seconds. How do I fix this?

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://xxxxx.com/properties/?sort=latest'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

 page_soup = soup(page_html, "html.parser")

 containers = page_soup.findAll("div",{"class":"grid-item"})
#len(containers) is empty since the contents haven't been loaded yet!

回答1:

If you want to wait for the page to fully load its data you should think about using selenium, in your case it could look like this:

from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options

url = "<URL>"

chrome_options = Options()  
chrome_options.add_argument("--headless") # Opens the browser up in background

with Chrome(options=chrome_options) as browser:
     browser.get(url)
     html = browser.page_source

page_soup = BeautifulSoup(html, 'html.parser')
containers = page_soup.findAll("div",{"class":"grid-item"})

回答2:

I'm happy with requests_html library. It will render Dynamic HTML for you. And is much simpler to implement than Selenium.

from requests_html import HTMLSession
import pyppdf.patch_pyppeteer
from bs4 import BeautifulSoup

url = 'https://xxxxx.com/properties/?sort=latest'

session = HTMLSession()


resp = session.get(link)
resp.html.render()
html = resp.html.html

page_soup = BeautifulSoup(html, 'html.parser')

containers = page_soup.find_all("div", {"class": "grid-item"})

来源：https://stackoverflow.com/questions/58773479/beautiful-soup-not-waiting-until-page-is-fully-loaded

标签

python

html

web-scraping

beautifulsoup