R: download pdf embedded in a webpage

问题

Trying to find an easier way to grab the table in the pdf embedded in this page or even better, download this pdf into a local drive:

My code is below and results are messy...

PS: none of the buttons at the bottom of the webpage works unless you use IE, and IE with RSelenium... I have created a code to load the page on IE, can successfully click on any of the buttons to either load the excel file (stuck at the step where it pops up a window to ask me to open or save) or open the pdf on the current window, but same problem, don't know how to grab the pdf. So dead ends everywhere.

Thanks in advance.

library(RSelenium)
checkForServer()
startServer()
remDr<-remoteDriver$new()
url<-"http://www.dmo.gov.uk/ceLogon.aspx?page=about&rptCode=D10A"
remDr$open(silent = TRUE) #opens a browser
remDr$navigate(url)
doc <- htmlParse(remDr$getPageSource()[[1]])

table <- readHTMLTable(doc, header=NA, stringsAsFactors=TRUE)

回答1:

It is not correct that IE is needed to get effective interaction with that page. Using Firefox or Chrome on a Mac the small printer icon above and to the left of the data columns offers to "Print this report" when moused-over and when clicked causes a file named "CrystalReportViewer1.pdf" to be downloaded. If one then uses the cross-platform browser-plugin application named Tabula, you can extract the data in csv form. The top of the extracted data (on April 1, 2016) looks like:

Syndication ,Gilt Name ,Amount Sold ,Issue ,Issue ,Announcement ,Results 
Date ,"",(£ million ,Price (£) ,Yield ,Press Release ,Press Release
"","",nominal),"","","",""
23 Feb 2016 ,0 1/8% Index-linked Treasury Gilt 2065 ," 2,750.0 ", 163.73 ,-0.8905% ,Announcement ,Results
01 Dec 2015 ,0 1/8% Index-linked Treasury Gilt 2046 ," 3,250.0 ", 129.74 ,-0.7475% ,Announcement ,Results
20 Oct 2015 ,2½% Treasury Gilt 2065 ," 4,750.0 ", 98.40 , 2.5570% ,Announcement ,Results
22 Sep 2015 ,0 1/8% Index-linked Treasury Gilt 2068 ," 2,500.0 ", 166.00 ,-0.8655% ,Announcement ,Results
21 Jul 2015 ,3½% Treasury Gilt 2068 ," 4,000.0 ", 121.31 , 2.7360% ,Announcement ,Results

Instead of trying to extract pdf from that page (which is not in pdf form as far as I can determine) you should instead use RSelenium to download the pdf file to a local drive and process it from there.

This is the button ID:

{'id':'CrystalReportViewer1_toptoolbar_print'}

There are demos in the RSelenium help pages. One is entitled: selDownloadZip.R. It shows how to execute a "click" on a "page Element":

webElem <- remDr$findElement("id", "CrystalReportViewer1_toptoolbar_print")
webElem$clickElement()

Then looking at the "element Inspector" in Firefox's ViewSource panel I see the name of the button ("id", "theBttnbobjid_1459536946505_dialog_submitBtn"), so a further click is needed. However that number changes with each page access, so use webElem <- remDr$findElement("link text", "Export")

 webElem <- remDr$findElement("link text", "Export")
 webElem$clickElement()

It would be a good idea to review the webElement-class help page.

来源：https://stackoverflow.com/questions/36359355/r-download-pdf-embedded-in-a-webpage

标签

pdf

rselenium