There are good answers on SO about how to use readHTMLTable from the XML package and I did that with regular http pages, however I am not able to solve my problem with https
The new package httr provides a wrapper around RCurl to make it easier to scrape all kinds of pages.
Still, this page gave me a fair amount of trouble. The following works, but no doubt there are easier ways of doing it.
library("httr")
library("XML")
# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")
# Read page
page <- GET(
"https://ned.nih.gov/",
path="search/ViewDetails.aspx",
query="NIHID=0010121048",
config(cainfo = cafile)
)
# Use regex to extract the desired table
x <- text_content(page)
tab <- sub('.*(.*
).*', '\\1', x)
# Parse the table
readHTMLTable(tab)
The results:
$ctl00_ContentPlaceHolder_dvPerson
V1 V2
1 Legal Name: Dr Francis S Collins
2 Preferred Name: Dr Francis Collins
3 E-mail: francis.collins@nih.gov
4 Location: BG 1 RM 1261 CENTER DRBETHESDA MD 20814
5 Mail Stop: Â
6 Phone: 301-496-2433
7 Fax: Â
8 IC: OD (Office of the Director)
9 Organization: Office of the Director (HNA)
10 Classification: Employee
11 TTY: Â
Get httr here: http://cran.r-project.org/web/packages/httr/index.html
EDIT: Useful page with FAQ about the RCurl package: http://www.omegahat.org/RCurl/FAQ.html