How to make “html_node” work for this website?

走远了吗. 提交于 2020-08-10 03:38:12

问题


I have an issue webscarping this website. If I try the "conventional" way, it works fine like in the code below:


base_url <- "https://www.ecb.europa.eu"
year_urls1 <- paste0(base_url, "/press/pressconf/", 2000:2008, "/html/index_include.en.html")

scrape_page <- function(url) {
  Sys.sleep(runif(1))
  
  html_attr(html_nodes(read_html(url), ".doc-title a"), name = "href")
}

all_pages1 <- lapply(year_urls1, scrape_page)
all_pages1 <- paste0(base_url, unlist(all_pages1))

But now let's assume for x reasons that read_htmldoesn't work on the url directly. To circumvent the problem I use url but then this seems to create problems with html_nodes. This problem is addressed here. However, I don't manage to get around with it. See the code below for only one case:

html_nodes(read_html(url(year_urls1[1, "rb")), ".doc-title a") # I get an empty object

Is there anyone who can help me with this? You would make my day!

Thanks a lot!

来源:https://stackoverflow.com/questions/63158332/how-to-make-html-node-work-for-this-website

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!