BeautifulSoup Scraping: loading div instead of the content

﹥>﹥吖頭↗ 提交于 2020-01-06 19:55:47

问题


Noob here. I'm trying to scrape search results from this website: http://www.mastersportal.eu/search/?q=di-4|lv-master&order=relevance

I'm using python's BeautifulSoup

import csv
import requests
from BeautifulSoup import BeautifulSoup

for numb in ('0', '69'):
        url = ('http://www.mastersportal.eu/search/?q=ci-30,11,10,3,4,8,9,14,15,16,17,34,1,19|di-4|lv-master|rv-1&start=' + numb + '0&order=tuition_eea&direction=asc')
        response = requests.get(url)
        html = response.content

        soup = BeautifulSoup(html)
        table = soup.find('div', attrs={'id': 'StudySearchResults'})

        lista = []
        for i in table.findAll('h3'):
            lista.append(h3.string)
print(table.prettify())

I want to get clean data with the basic information about the Master (for now just the name). The URL I'm using here is for a filtered research on the website and the loop to go on with pages should be fine.

However, the results are:

<div id="StudySearchResults">
  <div style="display:none" id="TrackingSearchValue" class="TrackingSearchValue" data-search=""></div>
  <div style="display:none" id="SearchViewEvent" class="TrackingEvent TrackingNoLocation" data-type="srch" data-action="view" data-id=""></div>
  <div id="StudySearchResultsStudies" class="TrackingLinkedList" data-start="" data-list-type="study" data-type="rslts">
    <!-- Wait pane, just here to make sure there is no white page -->
    <div id="WaitPane" class="WaitPane">
      <img src="http://www.mastersportal.eu/Modules/Results/Resources/Throbber.gif" />
      <span>Loading search results...</span>
    </div>
  </div>
</div>

Why isn't the content displaying but only the loading div? Reading around I feel it has something to do with the way the website handles data with JavaScript, does something like an AJAX request exist for Python? (or any other way to tell the scraper to wait for the page to load?)


回答1:


You have basically answered your own question. Beautiful Soup is a pure web scraper which will only download whatever the server returns for a specific URL.

If you want to render the page as it is shown in a browser, you will need to use something like Selenium Webdriver which will start up an actual browser and remote control it.

While using Webdriver is very powerful, it has a much steeper learning curve than pure web scraping as well though.

If you want to get into using Webdriver with Python, the official documentation is a good place to start.




回答2:


If you want only the text, you should do this

lista.append(h3.get_text())

Regarding your second question, jsfan's answer is right. You should try Selenium and use its wait feature to wait for your search results, that appear in divs with the class names Result master premium

element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.XPATH, "div[@class*='Result master premium']))
)


来源:https://stackoverflow.com/questions/35535039/beautifulsoup-scraping-loading-div-instead-of-the-content

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!