Parsing a website with BeautifulSoup and Selenium

问题

Trying to compare avg. temperatures to actual temperatures by scraping them from: https://usclimatedata.com/climate/binghamton/new-york/united-states/usny0124

I can successfully gather the webpage's source code, but I am having trouble parsing through it to only give the values for the high temps, low temps, rainfall and the averages under the "History" tab, but I can't seem to address the right class/id without getting the only result as "None".

This is what I have so far, with the last line being an attempt to get the high temps only:

from lxml import html
from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://usclimatedata.com/climate/binghamton/new-york/unitedstates/usny0124"
browser = webdriver.Chrome()
browser.get(url)
soup = BeautifulSoup(browser.page_source, "lxml")
data = soup.find("table", {'class': "align_right_climate_table_data_td_temperature_red"})

回答1:

First of all, these are two different classes - align_right and temperature_red - you've joined them and added that table_data_td for some reason. And, the elements having these two classes are td elements, not table.

In any case, to get the climate table, it looks like you should be looking for the div element having id="climate_table":

climate_table = soup.find(id="climate_table")

Another important thing to note that there is a potential for the "timing" issues here - when you get the driver.page_source value, the climate information might not be there. This is usually approached adding an Explicit Wait after navigating to the page:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup


url = "https://usclimatedata.com/climate/binghamton/new-york/unitedstates/usny0124"
browser = webdriver.Chrome()

try:
    browser.get(url)

    # wait for the climate data to be loaded
    WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "climate_table")))

    soup = BeautifulSoup(browser.page_source, "lxml")
    climate_table = soup.find(id="climate_table")

    print(climate_table.prettify())
finally:
    browser.quit()

Note the addition of the try/finally that would safely close the browser in case of an error - that would also help to avoid "hanging" browser windows.

And, look into pandas.read_html() that can read your climate information table into a DataFrame auto-magically.

来源：https://stackoverflow.com/questions/47520524/parsing-a-website-with-beautifulsoup-and-selenium

标签

python

selenium

web-scraping

beautifulsoup

html-parsing