How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?

前端未结

关注

 3  1633

迷失自我 2020-12-01 08:04

There are good answers on SO about how to use readHTMLTable from the XML package and I did that with regular http pages, however I am not able to solve my problem with https

3条回答

误落风尘 (楼主)

2020-12-01 08:58

The new package httr provides a wrapper around RCurl to make it easier to scrape all kinds of pages.

Still, this page gave me a fair amount of trouble. The following works, but no doubt there are easier ways of doing it.

library("httr")
library("XML")

# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")

# Read page
page <- GET(
  "https://ned.nih.gov/", 
  path="search/ViewDetails.aspx", 
  query="NIHID=0010121048",
  config(cainfo = cafile)
)

# Use regex to extract the desired table
x <- text_content(page)
tab <- sub('.*(.*).*', '\\1', x)

# Parse the table
readHTMLTable(tab)

The results:

$ctl00_ContentPlaceHolder_dvPerson
                V1                                      V2
1      Legal Name:                    Dr Francis S Collins
2  Preferred Name:                      Dr Francis Collins
3          E-mail:                 francis.collins@nih.gov
4        Location: BG 1 RM 1261 CENTER DRBETHESDA MD 20814
5       Mail Stop:                                       Â
6           Phone:                            301-496-2433
7             Fax:                                       Â
8              IC:             OD (Office of the Director)
9    Organization:            Office of the Director (HNA)
10 Classification:                                Employee
11            TTY:                                       Â

Get httr here: http://cran.r-project.org/web/packages/httr/index.html

EDIT: Useful page with FAQ about the RCurl package: http://www.omegahat.org/RCurl/FAQ.html

0 讨论(0)

查看其它3个回答