问题
Noob here. I'm trying to scrape search results from this website: http://www.mastersportal.eu/search/?q=di-4|lv-master&order=relevance
I'm using python's BeautifulSoup
import csv
import requests
from BeautifulSoup import BeautifulSoup
for numb in ('0', '69'):
url = ('http://www.mastersportal.eu/search/?q=ci-30,11,10,3,4,8,9,14,15,16,17,34,1,19|di-4|lv-master|rv-1&start=' + numb + '0&order=tuition_eea&direction=asc')
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('div', attrs={'id': 'StudySearchResults'})
lista = []
for i in table.findAll('h3'):
lista.append(h3.string)
print(table.prettify())
I want to get clean data with the basic information about the Master (for now just the name). The URL I'm using here is for a filtered research on the website and the loop to go on with pages should be fine.
However, the results are:
<div id="StudySearchResults">
<div style="display:none" id="TrackingSearchValue" class="TrackingSearchValue" data-search=""></div>
<div style="display:none" id="SearchViewEvent" class="TrackingEvent TrackingNoLocation" data-type="srch" data-action="view" data-id=""></div>
<div id="StudySearchResultsStudies" class="TrackingLinkedList" data-start="" data-list-type="study" data-type="rslts">
<!-- Wait pane, just here to make sure there is no white page -->
<div id="WaitPane" class="WaitPane">
<img src="http://www.mastersportal.eu/Modules/Results/Resources/Throbber.gif" />
<span>Loading search results...</span>
</div>
</div>
</div>
Why isn't the content displaying but only the loading div? Reading around I feel it has something to do with the way the website handles data with JavaScript, does something like an AJAX request exist for Python? (or any other way to tell the scraper to wait for the page to load?)
回答1:
You have basically answered your own question. Beautiful Soup is a pure web scraper which will only download whatever the server returns for a specific URL.
If you want to render the page as it is shown in a browser, you will need to use something like Selenium Webdriver which will start up an actual browser and remote control it.
While using Webdriver is very powerful, it has a much steeper learning curve than pure web scraping as well though.
If you want to get into using Webdriver with Python, the official documentation is a good place to start.
回答2:
If you want only the text, you should do this
lista.append(h3.get_text())
Regarding your second question, jsfan's answer is right. You should try Selenium and use its wait feature to wait for your search results, that appear in divs with the class names Result master premium
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "div[@class*='Result master premium']))
)
来源:https://stackoverflow.com/questions/35535039/beautifulsoup-scraping-loading-div-instead-of-the-content