问题
So with my code below I want to open an apartment website URL and scrape the webpage. The only issue is that Beautiful Soup isn't waiting until the entire webpage is rendered. The apartments aren't rendered in the html until they are loaded on the page, which takes a few seconds. How do I fix this?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://xxxxx.com/properties/?sort=latest'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"grid-item"})
#len(containers) is empty since the contents haven't been loaded yet!
回答1:
If you want to wait for the page to fully load its data you should think about using selenium, in your case it could look like this:
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
url = "<URL>"
chrome_options = Options()
chrome_options.add_argument("--headless") # Opens the browser up in background
with Chrome(options=chrome_options) as browser:
browser.get(url)
html = browser.page_source
page_soup = BeautifulSoup(html, 'html.parser')
containers = page_soup.findAll("div",{"class":"grid-item"})
回答2:
I'm happy with requests_html library. It will render Dynamic HTML for you. And is much simpler to implement than Selenium.
from requests_html import HTMLSession
import pyppdf.patch_pyppeteer
from bs4 import BeautifulSoup
url = 'https://xxxxx.com/properties/?sort=latest'
session = HTMLSession()
resp = session.get(link)
resp.html.render()
html = resp.html.html
page_soup = BeautifulSoup(html, 'html.parser')
containers = page_soup.find_all("div", {"class": "grid-item"})
来源:https://stackoverflow.com/questions/58773479/beautiful-soup-not-waiting-until-page-is-fully-loaded