pandas read_html ValueError: No tables found

后端 未结 2 2029
暗喜
暗喜 2020-12-03 23:33

I am trying to scrap the historical weather data from the \"https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mc

相关标签:
2条回答
  • 2020-12-04 00:11

    Here's a solution using selenium for browser automation

    from selenium import webdriver
    import pandas as pd
    driver = webdriver.Chrome(chromedriver)
    driver.implicitly_wait(30)
    
    driver.get('https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html')
        df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0]
    
    Time    Temperature Dew Point   Humidity    Wind    Speed   Gust    Pressure  Precip. Rate. Precip. Accum.  UV  Solar
    0   12:02 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²
    1   12:07 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²
    2   12:12 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²
    3   12:17 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²
    4   12:22 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²
    

    Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code:

    After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs)

    tab=driver.find_element_by_id("history_table")
    

    Then, from that element, we get the HTML instead of the web driver element object

    tab_html=tab.get_attribute('outerHTML')
    

    We use pandas to parse the html

    tab_dfs=pd.read_html(tab_html)
    

    From the docs:

    "read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content"

    So we index into that list with the only table we have, at index zero

    df=tab_dfs[0]
    
    0 讨论(0)
  • 2020-12-04 00:30

    You can use requests and avoid opening browser.

    You can get current conditions by using:

    https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15

    and strip of 'jQuery1720724027235122559_1542743885014(' from the left and ')' from the right. Then handle the json string.

    You can get summary and history by calling the API with the following

    https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276

    You then need to strip 'jQuery1720724027235122559_1542743885015(' from the front and ');' from the right. You then have a JSON string you can parse.

    Sample of JSON:

    You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.

    An example for current, noting there seems to be a problem with nulls in the JSON so I am replacing with "placeholder":

    import requests
    import pandas as pd
    import json
    from pandas.io.json import json_normalize
    from bs4 import BeautifulSoup
    
    url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'
    res = requests.get(url)
    soup = BeautifulSoup(res.content, "lxml")
    s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
    s = s.replace('null','"placeholder"')
    data= json.loads(s)
    data = json_normalize(data)
    df = pd.DataFrame(data)
    print(df)
    
    0 讨论(0)
提交回复
热议问题