问题
I recently discovered the rvest package in R and decided to try out some web scraping.
I wrote a small web crawler in a function so I could pipe it down to clean it up etc.
With a small url list (e.g. 1-100) the function works fine, however when a larger list is used the function hangs at some point. It seems like one of the commands is waiting for a response but does not seems to get one and does not result in an error.
urlscrape<-function(url_list) {
library(rvest)
library(dplyr)
assets<-NA
price<-NA
description<-NA
city<-NA
length(url_list)->n
pb <- txtProgressBar(min = 0, max = n, style = 3)
for (i in 1:n) {
#scraping for price#
try( {read_html(url_list[i]) %>% html_node(".price span") %>% html_text()->price[i]}, silent=TRUE)
#scraping for city#
try( {read_html(url_list[i]) %>% html_node(".city") %>% html_text()->city[i]}, silent=TRUE)
#scraping for description#
try( {read_html(url_list[i]) %>% html_nodes("h1") %>% html_text() %>% paste(collapse=" ") ->description[i]}, silent=TRUE)
#scraping for assets#
try( {read_html(url_list[i]) %>% html_nodes(".assets>li") %>% html_text() %>% paste(collapse=" ") ->assets[i]}, silent=TRUE)
Sys.sleep(2)
setTxtProgressBar(pb, i)
}
Sys.time()->time
print("")
paste("Finished at",time) %>% print()
print("")
return(as.data.frame(cbind(price,city,description,assets)) )
}
(1) Without knowing the exact problem I looked for a timeout option in the rvest package with no avail. I then tried to use the timeout option in the httr package (with still console hanging as a result). For ".price" it would become:
content(GET(url_list[i], timeout=(10)), timeout=(10), as="text") %>% read_html() %>% html_node(".price span") %>% html_text()->price[i]}, silent=TRUE)
I thought of other solutions and tried to implement them, but it did not work.
(2) Timelimit with setTimeLimit:
length(url_list)->n
pb <- txtProgressBar(min = 0, max = n, style = 3)
setTimeLimit(elapsed=20)
(3) Test for url succes, with c increasing after the 4th scrape:
for (i in 1:n) {
while(url_success(url_list[i])==TRUE & c==i) {
None of it worked and thus the function still hangs when the url list is big. Question: why would the console hang and how could it be solved? Thanks for reading.
回答1:
Unfortunatley, none of the above solutions worked for me. Some URL's freeze up R-Script, no matter if its with read_html(..) from rvest, GET(..), getUrl(..) or getUrlContent(..) from RCurl.
The only solution that worked for me is a combination of evalWithTimeout from R.utils and a tryCatchBlock:
# install.packages("R.utils")
# install.packages("rvest")
library(R.utils)
library(rvest)
pageIsBroken = FALSE
url = "http://www.detecon.com/de/bewerbungsformular?job-title=berater+f%c3%bcr+%e2%80%9cdigital+transformation%e2%80%9d+(m/w)"
page = tryCatch(
evalWithTimeout({ read_html(url, encoding="UTF-8") }, timeout = 5),
error = function(e) {
pageIsBroken <<- TRUE;
return(e)
}
)
if (pageIsBroken) {
print(paste("Error Msg:", toString(page)))
}
rrscriptrvestrcurlfreezingweb-scrapingconnection-timeoutread-htmlevalwithtimeout
回答2:
A simple workaround to your problem is to repeat the http request until you receive a successful response from the server:
for (i in 1:n) {
repeat {
html <- try(read_html(url_list[i]),silent=TRUE)
if(class(html) != "try-error") break
}
html %>% html_node(".price span") %>% html_text()->price[i]
}
回答3:
I encountered the same problem (read_html stalling after some web pages). In my case fetching the web page with RCurl's getURL helped. In combination with the post before you could try this:
repeat {
rawhtml <- try(getURL(link[i], .encoding="ISO-8859-1", .mapUnicode = F),silent=TRUE)
if(class(rawhtml) != "try-error") break
}
html<-read_html(rawhtml)
来源:https://stackoverflow.com/questions/32883512/comatose-web-crawler-in-r-w-rvest