How to automate saving webpages?

≡放荡痞女 提交于 2019-11-27 09:53:00

I've accepted Tim Vanderzeil's answer because he directed me to the tool that I needed for this. Now I want to share what I've done with what he gave me. The solution is only semi-automated because of a problem with Kantu, but it's far and away better than trying to do it all manually. I'm posting this here both to share what I've learned and to see if anyone can offer improvements, including a solution to the problem that is preventing full automation.

First, let me mention some background of the technology, which is interesting. Kantu, and especially its extension XModules (which is what I needed for this project), are pretty new. The company that makes them was founded in 2016 and Kantu was announced in September 2017. But their history is way deeper than that since its founders include Mathias Roth, the original developer of iMacros. Kantu is a different implementation of another tool I mentioned in my question, Selenium. So there's a lot of cross-pollination in this esoteric field of browser automation.

Many people have been asking on StackOverflow for a long time how to automate saving of webpages, such as 1, 2, 3, 4, 5, and 6. None of the answers appear to me to be all that helpful. It's a bit strange because all browsers have the capability, so there have to be some modules floating around somewhere for this, so I don't know why I can't just call a function for it in PHP. The question linked as #5 above says it appears in browsers through "Webkit", but knowing that hasn't led me anywhere useful yet.

So, in the meantime, until I find that PHP function, I have to do it by turning my Web browser into a robot. I developed the code below for a few e-books behind a paywall that I have a legitimate account for and want to preserve for offline use, and that are not offered as pdfs. I determined two ways I could download the pages with Kantu:

  • I massaged the HTML of the tables of contents pages to extract the needed URLs and put them into CSV files. This can be read by Kantu's command csvRead. The URL is passed to command open to open the page, then command XType sends Ctrl-S (or Alt-F-A) to tell the browser to save the page. XType is used again to enter the filename to save as (the part of the URL after the last "\"), and a final XType sends Enter to conclude the browser's Save-As dialog. Loop this, and the book is saved. The looping can be done either inside the macro using a label and command gotoLabel, or the macro can be written to do one page and the looping can be done in Kantu's GUI.

  • Alternatively, I can use the links on each page to go to the next page. This is the process I described in my question. I first used Kantu's recording process to get the identification of the next-page link, and use that as data in the code for the macro below (specifically as the "target" of commands XClick and click). I start up Kantu on the first webpage and the macro uses command XClick to right-click the next-page link, then XType to send "A" to the browser, telling it to copy the linked URL to the clipboard. Then the commend click clicks the link to open the page, and the rest is the same as the previous method. Here, I'm using the next-page links to get the URLs instead of a CSV file.

Now, I mentioned that there is a problem in Kantu that prevents this from being fully automated. The last step of the process, sending Enter to the browser to conclude the Save-As dialog, is flaky for unknown reasons. Sometimes it works, and sometimes the dialog box just sits there, requiring me to press Enter myself to allow the process to move on to the next webpage. This is tedious and means that I need to participate in the process instead of leaving it running on its own. So, not perfect, but a whole lot better than having to do all the rest of the procedure manually as well, which would be out of the question for several hundred pages.

The free version of XModules has a limit of 25 commands per run. To pass that limit there is a one-time charge of $50. That would probably be well worth it if I could let the process run on its own. But since I have to babysit it anyway, I'm currently running the macro by clicking on Kantu's Play macro button for each page as well as watching for when I need to press Enter.

I've posted about the Enter problem and some other issues on Kantu's forum. Their team has been very responsive and helpful. I hope that I or they or someone reading this can figure out a solution. In the meantime, the semi-automated process is better than nothing.

Between the two methods described above, it's only the second one, using the next-page links to get the URLs, that can run without a loop, i.e., with a manual press of Play macro for each page. So that's the one I've been using for now. The code has a rather inelegant repetition of 25 Ctrl-Lefts as a workaround for the surprising absence of the Home key in XType's vocabulary, as well as the absence (as far as I've found) of a command for repeating a key-press.

Here is the Kantu code, in JSON:

{"Name": "SavePageAsComplete",
 "CreationDate": "2019-01-03",
 "Commands":
  [{"Command": "comment",
    "Target":  "Macro for Kantu with XModules. Based on demo macros DemoXClick and 
         DemoXType and docs https://a9t9.com/kantu/docs/xclick and https://a9t9.com/kantu/docs/xtype. 
         The target in the XClick and click commands are what was obtained from 
         attempting to record this macro on the website, which resulted in only an open 
         command and two identical click commands with that target.",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Set play speed to 0.3 seconds. (See Kantu manual section 'Setting the right macro replay speed'.)",
    "Value":   ""
    },
   {"Command": "store",
    "Target":  "medium",
    "Value":   "!replayspeed"
    },
   {"Command": "bringBrowserToForeground",
    "Target":  "",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Right-click the link for the next page and copy its URL to the clipboard.",
    "Value":   ""
    },
   {"Command": "XClick",
    "Target":  "//*[@id=\"container\"]/div[2]/section/div[2]/a/div",
    "Value":   "#right"
    },
   {"Command": "XType",
    "Target":  "A",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Click the link for the next page. (Tried with 'clickAndWait' instead in 
         order to wait for the page to load, but that yielded error 'No page load 
         event detected after 10 seconds.')",
    "Value":   ""
    },
   {"Command": "click",
    "Target":  "//*[@id=\"container\"]/div[2]/section/div[2]/a/div",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Open the Save-as dialog.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_S}",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Wait for the dialog to appear.",
    "Value":   ""
    },
   {"Command": "pause",
    "Target":  "2000",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Paste the clipboard (URL of now-current page) into Filename text box.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_V}",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Move the cursor to the beginning of the URL. (There is no Home key!)",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}${KEY_CTRL+KEY_LEFT}",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Select from the beginning of the URL to the end of its path part.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}${KEY_SHIFT+KEY_CTRL+KEY_RIGHT}",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Delete the selection, leaving just the filename.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_DEL}",
    "Value":   ""
    },
   {"Command": "pause",
    "Target":  "500",
    "Value":   ""
    },
   {"Command": "comment",
    "Target":  "Save the page.",
    "Value":   ""
    },
   {"Command": "XType",
    "Target":  "${KEY_ENTER}",
    "Value":   ""
    }
   ]
 }

Maybe this will be of some help to other people who've been wanting to automate saving of pages. And if anyone can improve on this, maybe you could say how in a comment or another answer. Especially if you know why the Save-As dialog box doesn't close reliably, and know how to fix that.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!