Beautiful Soup not waiting until page is fully loaded

末鹿安然 提交于 2020-06-27 04:14:47

问题


So with my code below I want to open an apartment website URL and scrape the webpage. The only issue is that Beautiful Soup isn't waiting until the entire webpage is rendered. The apartments aren't rendered in the html until they are loaded on the page, which takes a few seconds. How do I fix this?

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://xxxxx.com/properties/?sort=latest'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

 page_soup = soup(page_html, "html.parser")

 containers = page_soup.findAll("div",{"class":"grid-item"})
#len(containers) is empty since the contents haven't been loaded yet!

回答1:


If you want to wait for the page to fully load its data you should think about using selenium, in your case it could look like this:

from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options

url = "<URL>"

chrome_options = Options()  
chrome_options.add_argument("--headless") # Opens the browser up in background

with Chrome(options=chrome_options) as browser:
     browser.get(url)
     html = browser.page_source

page_soup = BeautifulSoup(html, 'html.parser')
containers = page_soup.findAll("div",{"class":"grid-item"})



回答2:


I'm happy with requests_html library. It will render Dynamic HTML for you. And is much simpler to implement than Selenium.

from requests_html import HTMLSession
import pyppdf.patch_pyppeteer
from bs4 import BeautifulSoup

url = 'https://xxxxx.com/properties/?sort=latest'

session = HTMLSession()


resp = session.get(link)
resp.html.render()
html = resp.html.html

page_soup = BeautifulSoup(html, 'html.parser')

containers = page_soup.find_all("div", {"class": "grid-item"})


来源:https://stackoverflow.com/questions/58773479/beautiful-soup-not-waiting-until-page-is-fully-loaded

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!