问题
I'm trying to automatically download documents for Oil & Gas wells from the Colorado Oil and Gas Conservation Commission (COGCC) using the "rvest" and "downloader" packages in R.
The link to the table/form that contains the documents for a particular well is; http://ogccweblink.state.co.us/results.aspx?id=12337064
The "id=12337064" is the unique identifier for the well
The documents on the form page can be downloaded by clicking them. An example is below. http://ogccweblink.state.co.us/DownloadDocument.aspx?DocumentId=3172781
The "DocumentID=3172781" is the unique document ID for the document to be downloaded. In this case, an xlsm file. Other file formats on the document page include PDF and xls.
So far I've been able to write a code to download any document for any well but it is limited to only the first page. Majority of the wells have documents on multiple pages and I'm unable to download documents on pages other than page 1 (all document pages have similar URL)
## Extract the document id for document to be downloaded in this case "DIRECTIONAL DATA". Used the SelectorGadget tool to extract the CSS path
library(rvest)
html <- html("http://ogccweblink.state.co.us/results.aspx?id=12337064")
File <- html_nodes(html, "tr:nth-child(24) td:nth-child(4) a")
File <- as(File[[1]],'character')
DocId<-gsub('[^0-9]','',File)
DocId
[1] "3172781"
## To download the document, I use the downloader package
library(downloader)
linkDocId<-paste('http://ogccweblink.state.co.us/DownloadDocument.aspx DocumentId=',DocId,sep='')
download(linkDocId,"DIRECTIONAL DATA" ,mode='wb')
trying URL 'http://ogccweblink.state.co.us/DownloadDocument.aspx?DocumentId=3172781'
Content type 'application/octet-stream' length 33800 bytes (33 KB)
downloaded 33 KB
Does anyone know how I can modify my code to download documents on other pages?
Many thanks!
Em
回答1:
You have to use the very same cookie for the second query and pass the viewstate and validation fields as well. Quick example:
Load
RCurl
and load the URL and preserve the cookie:url <- 'http://ogccweblink.state.co.us/results.aspx?id=12337064' library(RCurl) curl <- curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = getCurlHandle()) page1 <- getURL(url, curl = curl)
Extract the
VIEWSTATE
andEVENTVALIDATION
values after parsing the HTML:page1 <- htmlTreeParse(page1, useInternal = TRUE) viewstate <- xpathSApply(page1, '//input[@name = "__VIEWSTATE"]', xmlGetAttr, 'value') validation <- xpathSApply(page1, '//input[@name = "__EVENTVALIDATION"]', xmlGetAttr, 'value')
Query the same URL again with the saved cookie, extracted hidden
INPUT
values and ask for the second page:page2 <- postForm(url, curl = curl, .params = list( '__EVENTARGUMENT' = 'Page$2', '__EVENTTARGET' = 'WQResultGridView', '__VIEWSTATE' = viewstate, '__EVENTVALIDATION' = validation))
Extract the URLs from the table shown on the second page:
page2 <- htmlTreeParse(page2, useInternal = TRUE) xpathSApply(page2, '//td/font/a', xmlGetAttr, 'href')
来源:https://stackoverflow.com/questions/32132344/download-documents-from-aspx-web-page-in-r