Get web page content (Not from source code) [duplicate]

别来无恙 提交于 2021-02-08 03:54:21

问题


I want to get the rainfall data of each day from here.

When I am in inspect mode, I can see the data. However, when I view the source code, I cannot find it.

I am using urllib2 and BeautifulSoup from bs4

Here is my code:

import urllib2
from bs4 import BeautifulSoup
link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1"

r = urllib2.urlopen(link)
soup = BeautifulSoup(r)
print soup.find_all("td", class_="td1_normal_class")
# I also tried this one
# print.find_all("div", class_="dataTable")

And I got an empty array.

My question is: How can I get the page content, but not from the page source code?


回答1:


If you cannot find the div in the source it means that the div you are looking for is generated. It could be using some JS framework like Angular or just JQuery. If you want to browse through the rendered HTML you have to use a browser which runs the JS code included.

Try using selenium

How can I parse a website using Selenium and Beautifulsoup in python?

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Firefox()
driver.get('http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1')

html = driver.page_source
soup = BeautifulSoup(html)

print soup.find_all("td", class_="td1_normal_class")

However note that using Selenium considerabily slows down the process since it has to pull up a headless browser.




回答2:


If you open up the dev tools on chrome/firefox and look at the requests, you'll see that the data is generated from a request to http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_2015.xml which gives the data for all 12 months which you can then extract from.



来源:https://stackoverflow.com/questions/39554955/get-web-page-content-not-from-source-code

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!