pandas read_html ValueError: No tables found

后端未结

关注

 2  2029

I am trying to scrap the historical weather data from the \"https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mc

相关标签:

2条回答

面向向阳花

2020-12-04 00:11

Here's a solution using selenium for browser automation

from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(chromedriver)
driver.implicitly_wait(30)

driver.get('https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html')
    df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0]

Time    Temperature Dew Point   Humidity    Wind    Speed   Gust    Pressure  Precip. Rate. Precip. Accum.  UV  Solar
0   12:02 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²
1   12:07 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²
2   12:12 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²
3   12:17 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²
4   12:22 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code:

After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs)

tab=driver.find_element_by_id("history_table")

Then, from that element, we get the HTML instead of the web driver element object

tab_html=tab.get_attribute('outerHTML')

We use pandas to parse the html

tab_dfs=pd.read_html(tab_html)

From the docs:

"read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content"

So we index into that list with the only table we have, at index zero

df=tab_dfs[0]

0 讨论(0)

余生分开走

2020-12-04 00:30
You can use requests and avoid opening browser.

You can get current conditions by using:

https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15

and strip of 'jQuery1720724027235122559_1542743885014(' from the left and ')' from the right. Then handle the json string.

You can get summary and history by calling the API with the following

https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276

You then need to strip 'jQuery1720724027235122559_1542743885015(' from the front and ');' from the right. You then have a JSON string you can parse.

Sample of JSON:

You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.

An example for current, noting there seems to be a problem with nulls in the JSON so I am replacing with "placeholder":
```
import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup

url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
s = s.replace('null','"placeholder"')
data= json.loads(s)
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...