pandas read_html ValueError: No tables found

后端 未结 2 2037
暗喜
暗喜 2020-12-03 23:33

I am trying to scrap the historical weather data from the \"https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mc

2条回答
  •  余生分开走
    2020-12-04 00:30

    You can use requests and avoid opening browser.

    You can get current conditions by using:

    https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15

    and strip of 'jQuery1720724027235122559_1542743885014(' from the left and ')' from the right. Then handle the json string.

    You can get summary and history by calling the API with the following

    https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276

    You then need to strip 'jQuery1720724027235122559_1542743885015(' from the front and ');' from the right. You then have a JSON string you can parse.

    Sample of JSON:

    You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.

    An example for current, noting there seems to be a problem with nulls in the JSON so I am replacing with "placeholder":

    import requests
    import pandas as pd
    import json
    from pandas.io.json import json_normalize
    from bs4 import BeautifulSoup
    
    url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'
    res = requests.get(url)
    soup = BeautifulSoup(res.content, "lxml")
    s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
    s = s.replace('null','"placeholder"')
    data= json.loads(s)
    data = json_normalize(data)
    df = pd.DataFrame(data)
    print(df)
    

提交回复
热议问题