Rcurl: url.exists returns false when url does exists

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-27 06:48:59

问题


Trying to download information from a specific web page, and although it opens fine in any browser, RCurl says it does not exists:

url.exists("http://www.transfermarkt.es/liga-mx-apertura/startseite/wettbewerb/MEXA")
[1] FALSE

Same results when using ".de".

url.exists("http://www.transfermarkt.de/liga-mx-clausura/startseite/wettbewerb/MEX1")
[1] FALSE

It also returns an error when using other functions of RCurl

> htmlParse("http://www.transfermarkt.es/liga-mx-apertura/startseite/wettbewerb/MEXA")
Error: failed to load HTTP resource

> htmlTreeParse("http://www.transfermarkt.es/liga-mx-apertura/startseite/wettbewerb/MEXA")
Error: failed to load HTTP resource

> htmlParse(getURL("http://www.transfermarkt.es/liga-mx-apertura/startseite/wettbewerb/MEXA"))
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr>
<center>nginx</center>
</body>
</html>

Why is this happening? How can successfully use htmlParse with this webpage?

EDIT:

I'm getting familiar with httr package, and this works just fine:

content(GET("http://www.transfermarkt.es/liga-mx-apertura/startseite/wettbewerb/MEXA"))

回答1:


That webserver appears to return a 403 Forbidden error when your HTTP request does not include a user-agent string. RCurl by default does not pass a user-agent. You can set one with the useragent= parameter.

myurl<-"http://www.transfermarkt.es/liga-mx-apertura/startseite/wettbewerb/MEXA"
url.exists(myurl, useragent="curl/7.39.0 Rcurl/1.95.4.5")
# [1] TRUE
htmlTreeParse(getURL(myurl, useragent="curl/7.39.0 Rcurl/1.95.4.5"))

The httr package is a bit nicer than RCurl for making HTTP requests in my opinion (and it sets a user-agent string by default). Here's the corresponding code

library(httr)
GET(myurl)


来源:https://stackoverflow.com/questions/29130237/rcurl-url-exists-returns-false-when-url-does-exists

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!