RCurl getURL with loop - link to a PDF kills looping

风格不统一 提交于 2019-12-23 02:21:32

问题


I've been puzzling this long enough now and can't seem to figure out how to get around it. Easiest to give working dummy code:

require(RCurl)
require(XML)

#set a bunch of options for curl
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Firefox/23.0" 
curl = getCurlHandle()
curlSetOpt(
  cookiejar = 'cookies.txt' ,
  useragent = agent,
  followlocation = TRUE ,
  autoreferer = TRUE ,
  httpauth = 1L, # "basic" http authorization version -- this seems to make a difference for India servers
  curl = curl
)


list1 <- c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933277.cms')

#note list2 has a new link inserted in 2nd position; this is the link that kills the following getURL calls
list2 <- c('http://timesofindia.indiatimes.com//articleshow/2933112.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933019.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
           'http://timesofindia.indiatimes.com//articleshow/2933277.cms')



for ( i in seq( list1 ) ){
  print(list1[i])
  html <-
    try( getURL(
      list1[i],
      maxredirs = as.integer(20),
      followlocation = TRUE,
      curl = curl
    ),TRUE)
  if (class (html) == "try-error") {
    print(paste("error accessing",list1[i]))
    rm(html)
    gc()
    next
  } else {
    print('success')
  }
}


gc()

for ( i in seq( list2 ) ){
  print(list2[i])
  html <-
    try( getURL(
      list2[i],
      maxredirs = as.integer(20),
      followlocation = TRUE,
      curl = curl
    ),TRUE)
  if (class (html) == "try-error") {
    print(paste("error accessing",list2[i]))
    rm(html)
    gc()
    next
  } else {
    print('success')
  }
}

This should be able to run with RCurl and XML libraries installed. The point being that when I insert http://timesofindia.indiatimes.com//articleshow/2933019.cms into the second position in the list, it kills the success of the rest of the loop (other links are the same). This happens (in this and other circumstances consistently) when the link contains a PDF (check to see).

Any thoughts on how to fix this so getting a link that contains a PDF doesn't kill my loop? As you can see, I have tried to clear out the potentially offending object, gc() all over the place, etc. but I can't figure out why a PDF kills my loop.

Thanks!

Just to check, here is my output for the two for loops:

    #[1] "http://timesofindia.indiatimes.com//articleshow/2933112.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933131.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933209.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933277.cms"
    #[1] "success"

and

    #[1] "http://timesofindia.indiatimes.com//articleshow/2933112.cms"
    #[1] "success"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933019.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933019.cms"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933131.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933131.cms"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933209.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933209.cms"
    #[1] "http://timesofindia.indiatimes.com//articleshow/2933277.cms"
    #[1] "error accessing http://timesofindia.indiatimes.com//articleshow/2933277.cms"

回答1:


You might find it easier to use httr. It wraps RCurl and sets the options you need by default. Here's the equivalent code with httr:

require(httr)

urls <- c(
  'http://timesofindia.indiatimes.com//articleshow/2933112.cms',
  'http://timesofindia.indiatimes.com//articleshow/2933019.cms',
  'http://timesofindia.indiatimes.com//articleshow/2933131.cms',
  'http://timesofindia.indiatimes.com//articleshow/2933209.cms',
  'http://timesofindia.indiatimes.com//articleshow/2933277.cms'
)

responses <- lapply(urls, GET)
sapply(responses, http_status)

sapply(responses, function(x) headers(x)$`content-type`)


来源:https://stackoverflow.com/questions/25466013/rcurl-geturl-with-loop-link-to-a-pdf-kills-looping

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!