Unable to get actual Markup from a page with BeautifulSoup

爷,独闯天下 提交于 2019-12-24 03:34:15

问题


I am trying to scrape this URL with combination of BeautifulSoup and Selinium

http://starwood.ugc.bazaarvoice.com/3523si-en_us/115/reviews.djs?format=embeddedhtml&page=2&scrollToTop=true

I have tried this code

active_review_page_html  = browser.page_source
active_review_page_html = active_review_page_html.replace('\\', "")
hotel_page_soup = BeautifulSoup(active_review_page_html)
print(hotel_page_soup)

But what is does that it is returning me data like

;<span class="BVRRReviewText">Hotel accommodations and staff were fine ....

But I have to scrape that span from that page with

for review_div in hotel_page_soup.select("span .BVRRReviewText"):

How can I get real markup from that URL?


回答1:


First of all, you are giving us the wrong link, instead of the actual page you are trying to scrape, you give us a link to the participating in the page load js file which would be a unnecessary challenge to parse.

Secondly, you don't need BeautifulSoup in this case, selenium itself is good at locating elements and extracting the text or attributes. No need for an extra step here.

Here's a working example using the actual page with reviews you want to get:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()  # or webdriver.Firefox()
driver.get('http://www.starwoodhotels.com/sheraton/property/reviews/index.html?propertyID=115&language=en_US')

# wait for the reviews to load
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "span.BVRRReviewText")))

# get reviews
for review_div in driver.find_elements_by_css_selector("span.BVRRReviewText"):
    print(review_div.text)
    print("---")

driver.close()

Prints:

This is not a low budget hotel . Yet the hotel offers no amenities. Nothing and no WiFi. In fact, you block the wifi that comes with my celluar plan. I am a part of 2 groups that are loyal to the Sheraton, Alabama A&M and the 9th Episcopal District AMEChurch but the Sheraton is not loyal to us.
---
We are a company that had (5) guest rooms at the hotel. Despite having a credit card on file for room and tax charges, my guest was charged the entire amount to her personal credit card. It has taken me (5) PHONE CALLS and my own time and energy to get this bill reversed. I guess leaving a message with information and a phone number numerous times is IGNORED at this hotel. You can guarantee that we will not return with our business. YOu may thank Kimerlin or Kimberly in your accounting office for her lack of personal service and follow through for the lost business in the future.
---
...

I've intentionally left you to handle pagination - let me know if you have difficulties.



来源:https://stackoverflow.com/questions/27134612/unable-to-get-actual-markup-from-a-page-with-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!