r rvest error: “Error in doc_namespaces(doc) : external pointer is not valid”

删除回忆录丶 提交于 2020-04-16 05:42:12

问题


My question is similar to this one, but the latter did not receive an answer I can work with. I am scraping thousands of urls with xml2::read_html. This works fine. But when I try and parse the resulting html documents using purrr::map_df and html_nodes, I get the following error:

Error in doc_namespaces(doc) : external pointer is not valid

For some reason, I am unable to reproduce the error using examples. The example below is not good, because it works totally fine. But if someone could explain me conceptually what the error means and how to solve it, that would be great (here is a github thread on a similar problem, but I don't follow all the technicalities).

library(rvest)
library(purrr)
urls_test <- list("https://en.wikipedia.org/wiki/FC_Barcelona",
             "https://en.wikipedia.org/wiki/Rome")
h <- urls_test %>% map(~{
  Sys.sleep(sample(seq(1, 3, by=0.001), 1))
  read_html(.x)})
out <- h %>% map_df(~{
  a <- html_nodes(., "#firstHeading") %>% html_text()
  a <- if (length(a) == 0) NA else a
  b <- html_nodes(., ".toctext") %>% html_text()
  b <- if (length(b) == 0) NA else b

  df <- tibble(a, b)
})

Session info:

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Devuan GNU/Linux ascii

来源:https://stackoverflow.com/questions/56261745/r-rvest-error-error-in-doc-namespacesdoc-external-pointer-is-not-valid

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!