rvest Error in open.connection(x, “rb”) : Timeout was reached

[亡魂溺海] 提交于 2019-11-26 12:27:19

问题


I\'m trying to scrape the content from http://google.com. the error message come out.

library(rvest)  
html(\"http://google.com\")

Error in open.connection(x, \"rb\") :
Timeout was reached In addition:
Warning message: \'html\' is deprecated.
Use \'read_html\' instead.
See help(\"Deprecated\")

since I\'m using company network ,this maybe caused by firewall or proxy. I try to use set_config ,but not working .


回答1:


I encountered the same Error in open.connection(x, “rb”) : Timeout was reached issue when working behind a proxy in the office network.

Here's what worked for me,

library(rvest)
url = "http://google.com"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")

Credit : https://stackoverflow.com/a/38463559




回答2:


This is probably an issue with your call to read_html (or html in your case) not properly identifying itself to server it's trying to retrieve content from, which is the default behaviour. Using curl, add a user agent to the handle argument of read_html to have your scraper identify itself.

library(rvest)
library(curl)
read_html(curl('http://google.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))



回答3:


I ran into this issue because my VPN was switched on. Immediately after turning it off, I re-tried, and it resolved the issue.




回答4:


I was facing a similar problem and a small hack solved it. There were 2 characters in the hyperlink who were creating the problem for me. Hence I replaced "è" with "e" & "é" with "e" and it worked. But just ensure that the hyperlink still remains valid.



来源:https://stackoverflow.com/questions/33295686/rvest-error-in-open-connectionx-rb-timeout-was-reached

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!