How to download a full webpage with a Python script?

前端未结

关注

 4  2007

星月不相逢 2020-12-09 18:36

Currently I have a script that can only download the HTML of a given page.

Now I want to download all the files of the web page

4条回答

庸人自扰 (楼主)

2020-12-09 19:10

Using Python 3+ Requests and other standard libraries.

The function savePage receives a requests.Response and the pagefilename where to save it.

Saves the pagefilename.html on the current folder

Downloads, javascripts, css and images based on the tags script, link and img and saved on a folder pagefilename_files.

Any exception are printed on sys.stderr, returns a BeautifulSoup object .

Requests session must be a global variable unless someone writes a cleaner code here for us.

You can adapt it to your needs.

import os, sys import requests from urllib.parse import urljoin from bs4 import BeautifulSoup def soupfindAllnSave(pagefolder, url, soup, tag2find='img', inner='src'): if not os.path.exists(pagefolder): # create only once os.mkdir(pagefolder) for res in soup.findAll(tag2find): # images, css, etc.. try: filename = os.path.basename(res[inner]) fileurl = urljoin(url, res.get(inner)) # rename to saved file path # res[inner] # may or may not exist filepath = os.path.join(pagefolder, filename) res[inner] = os.path.join(os.path.basename(pagefolder), filename) if not os.path.isfile(filepath): # was not downloaded with open(filepath, 'wb') as file: filebin = session.get(fileurl) file.write(filebin.content) except Exception as exc: print(exc, file=sys.stderr) return soup def savePage(response, pagefilename='page'): url = response.url soup = BeautifulSoup(response.text) pagefolder = pagefilename+'_files' # page contents soup = soupfindAllnSave(pagefolder, url, soup, 'img', inner='src') soup = soupfindAllnSave(pagefolder, url, soup, 'link', inner='href') soup = soupfindAllnSave(pagefolder, url, soup, 'script', inner='src') with open(pagefilename+'.html', 'w') as file: file.write(soup.prettify()) return soup

Example saving google page and its contents (google_files folder)

session = requests.Session() #... whatever requests config you need here response = session.get('https://www.google.com') savePage(response, 'google')

0 讨论(0)

查看其它4个回答

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复