问题
So I am trying to scrape the following webpage https://www.scoreboard.com/uk/football/england/premier-league/,
Specifically the scheduled and finished results. Thus I am trying to look for the elements with class = "stage-finished" or "stage-scheduled". However when I scrape the webpage and print out what page_soup contains, it doesn't contain these elements.
I found another SO question with an answer saying that this is because it is loaded via AJAX and I need to look at the XHR under the network tab on chrome dev tools to find the file thats loading the necessary data, however it doesn't seem to be there?
import bs4
import requests
from bs4 import BeautifulSoup as soup
import csv
import datetime
myurl = "https://www.scoreboard.com/uk/football/england/premier-league/"
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = requests.get(myurl, headers=headers)
page_soup = soup(page.content, "html.parser")
scheduled = page_soup.select(".stage-scheduled")
finished = page_soup.select(".stage-finished")
live = page_soup.select(".stage-live")
print(page_soup)
print(scheduled[0])
The above code throws an error of course as there is no content in the scheduled array.
My question is, how do I go about getting the data I'm looking for?
I copied the contents of the XHR files to a notepad and searched for stage-finished and other tags and found nothing. Am I missing something easy here?
回答1:
The page is JavaScript rendered. You need Selenium. Here is some code to start on:
from selenium import webdriver
url = 'https://www.scoreboard.com/uk/football/england/premier-league/'
driver = webdriver.Chrome()
driver.get(url)
stages = driver.find_elements_by_class_name('stage-scheduled')
driver.close()
Or you could pass driver.content in to the BeautifulSoup method. Like this:
soup = BeautifulSoup(driver.page_source, 'html.parser')
Note: You need to install a webdriver first. I installed chromedriver.
Good luck!
来源:https://stackoverflow.com/questions/52408701/beautifulsoup-cant-find-class-that-exists-on-webpage