Extracting HTML content from a search page using Beautiful Soup with Python

有些话、适合烂在心里 提交于 2020-01-11 11:25:08

问题


I'm trying to get some hotels info from booking.com using Beautiful Soup. I need to get certain info from all the accomodations in Spain. This is the search url:

https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM&sid=1677838e3fc7c26577ea908d40ad5faf&class_interval=1&dest_id=197&dest_type=country&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&oos_flag=0&postcard=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&src_elem=sb&ss=Spain&ss_all=0&ss_raw=spain&ssb=empty&sshis=0&order=popularity

When I inspect an accomodation in the result page using the developer tools it says that this is the tag to search:

<a class="hotel_name_link url" href="&#10;/hotel/es/aran-la-abuela.html?label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM;sid=1677838e3fc7c26577ea908d40ad5faf;ucfs=1;srpvid=b4980e34f6e50017;srepoch=1514167274;room1=A%2CA;hpos=1;hapos=1;dest_type=country;dest_id=197;srfid=198499756e07f93263596e1640823813c2ee4fe1X1;from=searchresults&#10;;highlight_room=#hotelTmpl" target="_blank" rel="noopener">
<span class="sr-hotel__name
" data-et-click="
customGoal:YPNdKNKNKZJUESUPTOdJDUFYQC:1
">
Hotel Spa Aran La Abuela
</span>
<span class="invisible_spoken">Opens in new window</span>
</a>

This is my Python code:

def init_BeautifulSoup():
    global page, soup
    page= requests.get("https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM&sid=1677838e3fc7c26577ea908d40ad5faf&class_interval=1&dest_id=197&dest_type=country&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&oos_flag=0&postcard=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&src_elem=sb&ss=Spain&ss_all=0&ss_raw=spain&ssb=empty&sshis=0&order=popularity")
    soup = BeautifulSoup(page.content, 'html.parser')


def get_spain_accomodations():
    global accomodations
    accomodations = soup.find_all(class_="hotel_name_link.url")

But when I run the code and print the accomodations variable it outputs a pair of brackets ([]). Then I printed the soup object and I realized that the parsed HTML is very different from the one I see in the developer tools in Chrome, that's why the soup object cant find the class "hotel_name_link.url"

What's going on?


回答1:


JavaScript is modifying the page after it loads. So, when you use page.content, it gives you the HTML content of the page before JS modifies the page.

You can use selenium to render the JS content. After the page loads, you can use driver.page_souce to get the page source after JS modifies it and pass it to BeautifulSoup.

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

def get_page(url):
    driver = webdriver.Chrome()
    driver.get(url)
    try:
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'h1')))
    except TimeoutException:
        print('Page timed out.')
        return None
    page = driver.page_source
    return page

def init_BeautifulSoup():
    global page, soup
    page = get_page('your-url')
    # handle the case where page may be None
    soup = BeautifulSoup(page, 'html.parser')

EDIT:

You'll need to change one thing here.

What the part WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'h1'))) does is that it makes the driver wait explicitly until the element is located on the webpage that we specify or throws TimeoutException after the delay time you specify (I've used 10 seconds).

I've just provided you with an example. You need to find out the element on the loaded page that is not present before the execution of the JavaScript and replace it here: (By.TAG_NAME, 'h1')

You can do this by inspecting elements after the page is loaded and checking whether the element exists or not in the HTML code of the page source.

Instead of By.TAG_NAME, you can use any of the following according to your requirement: ID, NAME, CLASS_NAME, CSS_SELECTOR, XPATH.



来源:https://stackoverflow.com/questions/47965265/extracting-html-content-from-a-search-page-using-beautiful-soup-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!