Save complete web page (incl css, images) using python/selenium

后端 未结 4 1950
一个人的身影
一个人的身影 2020-12-14 17:43

I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the res

相关标签:
4条回答
  • 2020-12-14 18:23

    I'll advise u to have a try on sikulix which is an image based automation tool for operate any widgets within PC OS, it supports python grammar and run with command line and maybe the simplest way to solve ur problem. All u need to do is just give it a screenshot, call sikulix script in ur python automation script(with OS.system("xxxx") or subprocess...).

    0 讨论(0)
  • 2020-12-14 18:24

    Inspired by FThompson's answer above, I came up with the following tool that can download full/complete html for a given page url (see: https://github.com/markfront/SinglePageFullHtml)

    UPDATE - follow up with Max's suggestion, below are steps to use the tool:

    1. Clone the project, then run maven to build:
    $> git clone https://github.com/markfront/SinglePageFullHtml.git
    
    $> cd ~/git/SinglePageFullHtml
    $> mvn clean compile package
    
    1. Find the generated jar file in target folder: SinglePageFullHtml-1.0-SNAPSHOT-jar-with-dependencies.jar

    2. Run the jar in command line like:

    $> java -jar .target/SinglePageFullHtml-1.0-SNAPSHOT-jar-with-dependencies.jar <page_url>
    
    1. The result file name will have a prefix "FP, followed by the hashcode of the page url, with file extension ".html". It will be found in either folder "/tmp" (which you can get by System.getProperty("java.io.tmp"). If not, try find it in your home dir or System.getProperty("user.home") in Java).

    2. The result file will be a big fat self-contained html file that includes everything (css, javascript, images, etc.) referred to by the original html source.

    0 讨论(0)
  • 2020-12-14 18:34

    This is not a perfect solution, but it will get you most of what you need. You can replicate the behavior of "save as full web page (complete)" by parsing the html and downloading any loaded files (images, css, js, etc.) to their same relative path.

    Most of the javascript won't work due to cross origin request blocking. But the content will look (mostly) the same.

    This uses requests to save the loaded files, lxml to parse the html, and os for the path legwork.

    from selenium import webdriver
    import chromedriver_binary
    from lxml import html
    import requests
    import os
    
    driver = webdriver.Chrome()
    URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
    SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' 
    base = 'https://blast.ncbi.nlm.nih.gov/'
    
    driver.get(URL)
    seq_query_field = driver.find_element_by_id("seq")
    seq_query_field.send_keys(SEQUENCE)
    blast_button = driver.find_element_by_id("b1")
    blast_button.click()
    
    content = driver.page_source
    # write the page content
    os.mkdir('page')
    with open('page/page.html', 'w') as fp:
        fp.write(content)
    
    # download the referenced files to the same path as in the html
    sess = requests.Session()
    sess.get(base)            # sets cookies
    
    # parse html
    h = html.fromstring(content)
    # get css/js files loaded in the head
    for hr in h.xpath('head//@href'):
        if not hr.startswith('http'):
            local_path = 'page/' + hr
            hr = base + hr
        res = sess.get(hr)
        if not os.path.exists(os.path.dirname(local_path)):
            os.makedirs(os.path.dirname(local_path))
        with open(local_path, 'wb') as fp:
            fp.write(res.content)
    
    # get image/js files from the body.  skip anything loaded from outside sources
    for src in h.xpath('//@src'):
        if not src or src.startswith('http'):
            continue
        local_path = 'page/' + src
        print(local_path)
        src = base + src
        res = sess.get(hr)
        if not os.path.exists(os.path.dirname(local_path)):
            os.makedirs(os.path.dirname(local_path))
        with open(local_path, 'wb') as fp:
            fp.write(res.content)  
    

    You should have a folder called page with a file called page.html in it with the content you are after.

    0 讨论(0)
  • 2020-12-14 18:44

    As you noted, Selenium cannot interact with the browser's context menu to use Save as..., so instead to do so, you could use an external automation library like pyautogui.

    pyautogui.hotkey('ctrl', 's')
    time.sleep(1)
    pyautogui.typewrite(SEQUENCE + '.html')
    pyautogui.hotkey('enter')
    

    This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter. This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case. If needed, you could additionally change the download location through some extra work with the tab and arrow keys.

    Tested on Ubuntu 18.10; depending on your OS you may need to modify the key combination sent.


    Full code, in which I also added conditional waits to improve speed:

    import time
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.expected_conditions import visibility_of_element_located
    from selenium.webdriver.support.ui import WebDriverWait
    import pyautogui
    
    URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
    SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
    
    # open page with selenium
    # (first need to download Chrome webdriver, or a firefox webdriver, etc)
    driver = webdriver.Chrome()
    driver.get(URL)
    
    # enter sequence into the query field and hit 'blast' button to search
    seq_query_field = driver.find_element_by_id("seq")
    seq_query_field.send_keys(SEQUENCE)
    
    blast_button = driver.find_element_by_id("b1")
    blast_button.click()
    
    # wait until results are loaded
    WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))
    
    # open 'Save as...' to save html and assets
    pyautogui.hotkey('ctrl', 's')
    time.sleep(1)
    pyautogui.typewrite(SEQUENCE + '.html')
    pyautogui.hotkey('enter')
    
    0 讨论(0)
提交回复
热议问题