How to use rvest to web crawling correctly?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-25 10:02:42

问题


I try to web crawl this page http://www.funda.nl/en/koop/leiden/ to get the max page it could show which is 29. I followed some online tutorial and located where 29 is in the html code, wrote this R code

url<-  read_html("http://www.funda.nl/en/koop/leiden/")

url %>% html_nodes("#pagination-number.pagination-last") %>% html_attr("data-
pagination-page") %>% as.numeric() 

However, what I got is numeric(0). If I remove as.numeric(), I get character(0).

How is this done ?


回答1:


I believe that both your identification of the html and your parsing of the html are wrong. To easily find the name of a CSS id, you can use a chrome extension called Selector Gadget. In your case, it also requires some parsing, accomplished in the str_extract_all() function.

This will work:

url <-  read_html("http://www.funda.nl/en/koop/leiden/")

pagination.last <- url %>% 
  html_node(".pagination-last") %>%
  html_text() %>% 
  stringr::str_extract_all("[:number:]{1,2}", simplify = TRUE) %>%
  as.numeric()

> pagination.last
[1] 29

You might find this other question helpful as well: R: Rvest - got hidden text i don't want




回答2:


I've been dealing with the same issue and this worked for me:

> url = "http://www.funda.nl/en/koop/leiden/"
> last_page <-
+   last(read_html(url) %>% 
+          html_nodes(css = ".pagination-pages") %>%
+          html_children()) %>% 
+   html_text(trim = T) %>% 
+   str_extract("[0-9]+") %>% 
+   as.numeric()
> last_page
[1] 23


来源:https://stackoverflow.com/questions/44346556/how-to-use-rvest-to-web-crawling-correctly

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!