Comatose web crawler in R (w/ rvest)

霸气de小男生 提交于 2019-12-04 16:58:42

Unfortunatley, none of the above solutions worked for me. Some URL's freeze up R-Script, no matter if its with read_html(..) from rvest, GET(..), getUrl(..) or getUrlContent(..) from RCurl.

The only solution that worked for me is a combination of evalWithTimeout from R.utils and a tryCatchBlock:

# install.packages("R.utils")
# install.packages("rvest")
library(R.utils)
library(rvest)
pageIsBroken = FALSE

url = "http://www.detecon.com/de/bewerbungsformular?job-title=berater+f%c3%bcr+%e2%80%9cdigital+transformation%e2%80%9d+(m/w)"

page = tryCatch(

  evalWithTimeout({ read_html(url, encoding="UTF-8") }, timeout = 5),

  error = function(e) {
    pageIsBroken <<- TRUE; 
    return(e)
  }
)

if (pageIsBroken) {
  print(paste("Error Msg:", toString(page)))
}

A simple workaround to your problem is to repeat the http request until you receive a successful response from the server:

for (i in 1:n) {
    repeat {
          html <- try(read_html(url_list[i]),silent=TRUE)
          if(class(html) != "try-error") break
        }
    html %>% html_node(".price span") %>% html_text()->price[i]
}

I encountered the same problem (read_html stalling after some web pages). In my case fetching the web page with RCurl's getURL helped. In combination with the post before you could try this:

repeat {
  rawhtml <- try(getURL(link[i], .encoding="ISO-8859-1", .mapUnicode = F),silent=TRUE)
  if(class(rawhtml) != "try-error") break
}

html<-read_html(rawhtml)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!