Using tryCatch and rvest to deal with 404 and other crawling errors

前端 未结 2 2021
醉梦人生
醉梦人生 2020-12-16 03:38

When retrieving the h1 title using rvest, I sometimes run into 404 pages. This stop the process and returns this error.

Error in open.con

2条回答
  •  暖寄归人
    2020-12-16 04:09

    You can see this Question for explanation here

    urls<-c(
        "http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
        "http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
        "http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html",
        "http://boingboing.net/2016/06/16/omar-mateen-posted-to-facffffdebook.html")
    
    
    readUrl <- function(url) {
        out <- tryCatch(
            {
                message("This is the 'try' part")
                url %>% as.character() %>% read_html() %>% html_nodes('h1') %>% html_text() 
            },
            error=function(cond) {
                message(paste("URL does not seem to exist:", url))
                message("Here's the original error message:")
                message(cond)
                return(NA)
            }
            }
        )    
        return(out)
    }
    
    y <- lapply(urls, readUrl)
    

提交回复
热议问题