How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?

前端 未结 3 1623
迷失自我
迷失自我 2020-12-01 08:04

There are good answers on SO about how to use readHTMLTable from the XML package and I did that with regular http pages, however I am not able to solve my problem with https

3条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-12-01 09:04

    Using Andrie's great way to get past the https

    a way to get at the data without readHTMLTable is also below.

    A table in HTML may have an ID. In this case the table has one nice one and the XPath in getNodeSet function does it nicely.

    # Define certicificate file
    cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")
    # Read page
    page <- GET(
      "https://ned.nih.gov/", 
      path="search/ViewDetails.aspx", 
      query="NIHID=0010121048",
      config(cainfo = cafile, ssl.verifypeer = FALSE)
    )
    
    h = htmlParse(page)
    ns <- getNodeSet(h, "//table[@id = 'ctl00_ContentPlaceHolder_dvPerson']")
    ns
    

    I still need to extract the IDs behind the hyperlinks.

    for example instead of collen baros as manager, I need to get to the ID 0010080638

    Manager:Colleen Barros

提交回复
热议问题