I am trying to scrap the historical weather data from the \"https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mc
Here's a solution using selenium for browser automation
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(chromedriver)
driver.implicitly_wait(30)
driver.get('https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html')
df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0]
Time Temperature Dew Point Humidity Wind Speed Gust Pressure Precip. Rate. Precip. Accum. UV Solar
0 12:02 AM 25.5 °C 18.7 °C 75 % East 0 kph 0 kph 29.3 hPa 0 mm 0 mm 0 0 w/m²
1 12:07 AM 25.5 °C 19 °C 76 % East 0 kph 0 kph 29.31 hPa 0 mm 0 mm 0 0 w/m²
2 12:12 AM 25.5 °C 19 °C 76 % East 0 kph 0 kph 29.31 hPa 0 mm 0 mm 0 0 w/m²
3 12:17 AM 25.5 °C 18.7 °C 75 % East 0 kph 0 kph 29.3 hPa 0 mm 0 mm 0 0 w/m²
4 12:22 AM 25.5 °C 18.7 °C 75 % East 0 kph 0 kph 29.3 hPa 0 mm 0 mm 0 0 w/m²
Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code:
After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs)
tab=driver.find_element_by_id("history_table")
Then, from that element, we get the HTML instead of the web driver element object
tab_html=tab.get_attribute('outerHTML')
We use pandas to parse the html
tab_dfs=pd.read_html(tab_html)
From the docs:
"read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content"
So we index into that list with the only table we have, at index zero
df=tab_dfs[0]
You can use requests
and avoid opening browser.
You can get current conditions by using:
https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15
and strip of 'jQuery1720724027235122559_1542743885014('
from the left and ')'
from the right. Then handle the json string.
You can get summary and history by calling the API with the following
https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276
You then need to strip 'jQuery1720724027235122559_1542743885015('
from the front and ');'
from the right. You then have a JSON string you can parse.
Sample of JSON:
You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.
An example for current
, noting there seems to be a problem with nulls
in the JSON so I am replacing with "placeholder"
:
import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup
url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
s = s.replace('null','"placeholder"')
data= json.loads(s)
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)