How to download a full webpage with a Python script?

前端 未结 4 2007
星月不相逢
星月不相逢 2020-12-09 18:36

Currently I have a script that can only download the HTML of a given page.

Now I want to download all the files of the web page

4条回答
  •  庸人自扰
    2020-12-09 19:10

    Using Python 3+ Requests and other standard libraries.

    The function savePage receives a requests.Response and the pagefilename where to save it.

    • Saves the pagefilename.html on the current folder
    • Downloads, javascripts, css and images based on the tags script, link and img and saved on a folder pagefilename_files.
    • Any exception are printed on sys.stderr, returns a BeautifulSoup object .
    • Requests session must be a global variable unless someone writes a cleaner code here for us.

    You can adapt it to your needs.


    import os, sys
    import requests
    from urllib.parse import urljoin
    from bs4 import BeautifulSoup
    
    def soupfindAllnSave(pagefolder, url, soup, tag2find='img', inner='src'):
        if not os.path.exists(pagefolder): # create only once
            os.mkdir(pagefolder)
        for res in soup.findAll(tag2find):   # images, css, etc..
            try:
                filename = os.path.basename(res[inner])  
                fileurl = urljoin(url, res.get(inner))
                # rename to saved file path
                # res[inner] # may or may not exist 
                filepath = os.path.join(pagefolder, filename)
                res[inner] = os.path.join(os.path.basename(pagefolder), filename)
                if not os.path.isfile(filepath): # was not downloaded
                    with open(filepath, 'wb') as file:
                        filebin = session.get(fileurl)
                        file.write(filebin.content)
            except Exception as exc:      
                print(exc, file=sys.stderr)
        return soup
    
    def savePage(response, pagefilename='page'):    
       url = response.url
       soup = BeautifulSoup(response.text)
       pagefolder = pagefilename+'_files' # page contents 
       soup = soupfindAllnSave(pagefolder, url, soup, 'img', inner='src')
       soup = soupfindAllnSave(pagefolder, url, soup, 'link', inner='href')
       soup = soupfindAllnSave(pagefolder, url, soup, 'script', inner='src')    
       with open(pagefilename+'.html', 'w') as file:
          file.write(soup.prettify())
       return soup
    

    Example saving google page and its contents (google_files folder)

    session = requests.Session()
    #... whatever requests config you need here
    response = session.get('https://www.google.com')
    savePage(response, 'google')
    

提交回复
热议问题