How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?

前端未结

关注

 3  1623

迷失自我 2020-12-01 08:04

There are good answers on SO about how to use readHTMLTable from the XML package and I did that with regular http pages, however I am not able to solve my problem with https

3条回答

小蘑菇 (楼主)

2020-12-01 09:04
Using Andrie's great way to get past the https

a way to get at the data without readHTMLTable is also below.

A table in HTML may have an ID. In this case the table has one nice one and the XPath in getNodeSet function does it nicely.
```
# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")
# Read page
page <- GET(
  "https://ned.nih.gov/", 
  path="search/ViewDetails.aspx", 
  query="NIHID=0010121048",
  config(cainfo = cafile, ssl.verifypeer = FALSE)
)

h = htmlParse(page)
ns <- getNodeSet(h, "//table[@id = 'ctl00_ContentPlaceHolder_dvPerson']")
ns
```
I still need to extract the IDs behind the hyperlinks.

for example instead of collen baros as manager, I need to get to the ID 0010080638

Manager:Colleen Barros
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...