Rvest How to avoid the Error in open.connection(x, “rb”) : HTTP error 404 R

这一生的挚爱 提交于 2019-12-08 03:26:19

问题


I'd like to take some informations from a list of website. I have a list of urls, but there are some that doesn't work/exesist.

The Error is:

Error in open.connection(x, "rb") : HTTP error 404 R

library(Rvest)
url_web<-(c("https://it.wikipedia.org/wiki/Roma",
        "https://it.wikipedia.org/wiki/Milano",
        "https://it.wikipedia.org/wiki/Napoli",
        "https://it.wikipedia.org/wiki/Torinoooo", # for example this is an error
        "https://it.wikipedia.org/wiki/Palermo",
        "https://it.wikipedia.org/wiki/Venezia"))

I write this code for my target.

I tried to use try, but doesn't work.

I tried to use an ifelse(url.exists(url_web)==TRUE,Cont<-read_html(url_web), NA ) into the for, but doesn't work.

for (i in 1:length(url_web)){
      Cont<-read_html(i)
      Dist_1<-html_nodes(Cont, ".firstHeading") %>% 
      html_text()
      print(Dist_1)
    }

The question is: How I can jump the url where I can't link or where is writes wrong?

Thank you in advance.

Francesco


回答1:


A simple try should do the trick

parsed_pages <- replicate(list(), n = length(url_web))
for (k in seq_along(url_web)) parsed_pages[[k]] <- try(xml2::read_html(url_web[k]), silent = TRUE)

The silent = TRUE argument means any error will be disregarded. By default, silent = FALSE which makes try report the errors. Note that even if silent = FALSE the code works (the reported errors might make it look as thought it didn't).

Here we can test the above code

for (k in seq_along(url_web)) print(class(parsed_pages[[k]]))
# [1] "xml_document" "xml_node"    
# [1] "xml_document" "xml_node"    
# [1] "xml_document" "xml_node"    
# [1] "try-error"
# [1] "xml_document" "xml_node"    
# [1] "xml_document" "xml_node" 


来源:https://stackoverflow.com/questions/56303257/rvest-how-to-avoid-the-error-in-open-connectionx-rb-http-error-404-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!