问题
Trying to compare avg. temperatures to actual temperatures by scraping them from: https://usclimatedata.com/climate/binghamton/new-york/united-states/usny0124
I can successfully gather the webpage's source code, but I am having trouble parsing through it to only give the values for the high temps, low temps, rainfall and the averages under the "History" tab, but I can't seem to address the right class/id without getting the only result as "None".
This is what I have so far, with the last line being an attempt to get the high temps only:
from lxml import html
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://usclimatedata.com/climate/binghamton/new-york/unitedstates/usny0124"
browser = webdriver.Chrome()
browser.get(url)
soup = BeautifulSoup(browser.page_source, "lxml")
data = soup.find("table", {'class': "align_right_climate_table_data_td_temperature_red"})
回答1:
First of all, these are two different classes - align_right
and temperature_red
- you've joined them and added that table_data_td
for some reason. And, the elements having these two classes are td
elements, not table
.
In any case, to get the climate table, it looks like you should be looking for the div
element having id="climate_table"
:
climate_table = soup.find(id="climate_table")
Another important thing to note that there is a potential for the "timing" issues here - when you get the driver.page_source
value, the climate information might not be there. This is usually approached adding an Explicit Wait after navigating to the page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
url = "https://usclimatedata.com/climate/binghamton/new-york/unitedstates/usny0124"
browser = webdriver.Chrome()
try:
browser.get(url)
# wait for the climate data to be loaded
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "climate_table")))
soup = BeautifulSoup(browser.page_source, "lxml")
climate_table = soup.find(id="climate_table")
print(climate_table.prettify())
finally:
browser.quit()
Note the addition of the try/finally
that would safely close the browser in case of an error - that would also help to avoid "hanging" browser windows.
And, look into pandas.read_html() that can read your climate information table into a DataFrame auto-magically.
来源:https://stackoverflow.com/questions/47520524/parsing-a-website-with-beautifulsoup-and-selenium