问题
I have an issue webscarping this website. If I try the "conventional" way, it works fine like in the code below:
base_url <- "https://www.ecb.europa.eu"
year_urls1 <- paste0(base_url, "/press/pressconf/", 2000:2008, "/html/index_include.en.html")
scrape_page <- function(url) {
Sys.sleep(runif(1))
html_attr(html_nodes(read_html(url), ".doc-title a"), name = "href")
}
all_pages1 <- lapply(year_urls1, scrape_page)
all_pages1 <- paste0(base_url, unlist(all_pages1))
But now let's assume for x reasons that read_html
doesn't work on the url directly. To circumvent the problem I use url
but then this seems to create problems with html_nodes
. This problem is addressed here. However, I don't manage to get around with it. See the code below for only one case:
html_nodes(read_html(url(year_urls1[1, "rb")), ".doc-title a") # I get an empty object
Is there anyone who can help me with this? You would make my day!
Thanks a lot!
来源:https://stackoverflow.com/questions/63158332/how-to-make-html-node-work-for-this-website