How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?

前端 未结 3 1635
迷失自我
迷失自我 2020-12-01 08:04

There are good answers on SO about how to use readHTMLTable from the XML package and I did that with regular http pages, however I am not able to solve my problem with https

3条回答
  •  天涯浪人
    2020-12-01 08:59

    This is the function I have to deal with this problem. Detects if https in url and uses httr if it is.

    readHTMLTable2=function(url, which=NULL, ...){
     require(httr)
     require(XML)
     if(str_detect(url,"https")){
        page <- GET(url, user_agent("httr-soccer-ranking"))
        doc = htmlParse(text_content(page))
        if(is.null(which)){
          tmp=readHTMLTable(doc, ...)
          }else{
            tableNodes = getNodeSet(doc, "//table")
            tab=tableNodes[[which]]
            tmp=readHTMLTable(tab, ...) 
          }
      }else{
        tmp=readHTMLTable(url, which=which, ...) 
      }
      return(tmp)
    }
    

提交回复
热议问题