Save complete web page (incl css, images) using python/selenium

后端 未结 4 1974
一个人的身影
一个人的身影 2020-12-14 17:43

I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the res

4条回答
  •  不知归路
    2020-12-14 18:34

    This is not a perfect solution, but it will get you most of what you need. You can replicate the behavior of "save as full web page (complete)" by parsing the html and downloading any loaded files (images, css, js, etc.) to their same relative path.

    Most of the javascript won't work due to cross origin request blocking. But the content will look (mostly) the same.

    This uses requests to save the loaded files, lxml to parse the html, and os for the path legwork.

    from selenium import webdriver
    import chromedriver_binary
    from lxml import html
    import requests
    import os
    
    driver = webdriver.Chrome()
    URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
    SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' 
    base = 'https://blast.ncbi.nlm.nih.gov/'
    
    driver.get(URL)
    seq_query_field = driver.find_element_by_id("seq")
    seq_query_field.send_keys(SEQUENCE)
    blast_button = driver.find_element_by_id("b1")
    blast_button.click()
    
    content = driver.page_source
    # write the page content
    os.mkdir('page')
    with open('page/page.html', 'w') as fp:
        fp.write(content)
    
    # download the referenced files to the same path as in the html
    sess = requests.Session()
    sess.get(base)            # sets cookies
    
    # parse html
    h = html.fromstring(content)
    # get css/js files loaded in the head
    for hr in h.xpath('head//@href'):
        if not hr.startswith('http'):
            local_path = 'page/' + hr
            hr = base + hr
        res = sess.get(hr)
        if not os.path.exists(os.path.dirname(local_path)):
            os.makedirs(os.path.dirname(local_path))
        with open(local_path, 'wb') as fp:
            fp.write(res.content)
    
    # get image/js files from the body.  skip anything loaded from outside sources
    for src in h.xpath('//@src'):
        if not src or src.startswith('http'):
            continue
        local_path = 'page/' + src
        print(local_path)
        src = base + src
        res = sess.get(hr)
        if not os.path.exists(os.path.dirname(local_path)):
            os.makedirs(os.path.dirname(local_path))
        with open(local_path, 'wb') as fp:
            fp.write(res.content)  
    

    You should have a folder called page with a file called page.html in it with the content you are after.

提交回复
热议问题