Downloading a web page and all of its resource files in Python

南楼画角 提交于 2019-11-30 13:28:10

问题


I want to be able to download a page and all of its associated resources (images, style sheets, script files, etc) using Python. I am (somewhat) familiar with urllib2 and know how to download individual urls, but before I go and start hacking at BeautifulSoup + urllib2 I wanted to be sure that there wasn't already a Python equivalent to "wget --page-requisites http://www.google.com".

Specifically I am interested in gathering statistical information about how long it takes to download an entire web page, including all resources.

Thanks Mark


回答1:


Websucker? See http://effbot.org/zone/websucker.htm




回答2:


websucker.py doesn't import css links. HTTrack.com is not python, it's C/C++, but it's a good, maintained, utility for downloading a website for offline browsing.

http://www.mail-archive.com/python-bugs-list@python.org/msg13523.html [issue1124] Webchecker not parsing css "@import url"

Guido> This is essentially unsupported and unmaintaned example code. Feel free to submit a patch though!



来源:https://stackoverflow.com/questions/844115/downloading-a-web-page-and-all-of-its-resource-files-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!