Scrape single node excluding others in same category

六眼飞鱼酱① 提交于 2019-12-13 03:35:13

问题


Building off this question, I'm looking to extract a single node ("likes") from the smallText node, but ignoring others. The node I'm looking for is a.SmallText, so need to select only that one.

code:

url <- "https://www.goodreads.com/quotes/search?page=1&q=simone+de+beauvoir&utf8=%E2%9C%93"

quote_rating <- function(html){

  path <- read_html(html)

  path %>% 
    html_nodes(xpath = paste(selectr::css_to_xpath(".smallText"), "/text()"))%>%
    html_text(trim = TRUE) %>% 
    str_trim(side = "both") %>% 
    enframe(name = NULL)
}

quote_rating(url)

Which gives a result:

# A tibble: 80 x 1
   value              
   <chr>              
 1 Showing 1-20 of 790
 2 (0.03 seconds)     
 3 tags:              
 4 ""                 
 5 2492 likes         
 6 2265 likes         
 7 tags:              
 8 ,                  
 9 ,                  
10 ,                  
# ... with 70 more rows

Add a html_nodes("a.smallText") filters too much:

quote_rating <- function(html){

  path <- read_html(html) 

  path %>% 
    html_nodes(xpath = paste(selectr::css_to_xpath(".smallText"), "/text()")) %>%
    html_nodes("a.smallText") %>% 
    html_text(trim = TRUE) %>%
    str_trim(side = "both") %>% 
    enframe(name = NULL)

}

# A tibble: 0 x 1
# ... with 1 variable: value <chr>
> 


回答1:


To extract the number of likes for each quote. One can perform the filtering using just the css selectors, one want to look for the a tags with class=smallText.

This simple code fragment works:

library(rvest)
url <- "https://www.goodreads.com/quotes/search?page=1&q=simone+de+beauvoir&utf8=%E2%9C%93"

path <- read_html(url) 

path %>% 
    html_nodes("a.smallText") %>% 
    html_text(trim = TRUE)

# [1] "2492 likes" "2265 likes" "2168 likes" "2003 likes" "1774 likes" "1060 likes" "580 likes" 
# [8] "523 likes"  "482 likes"  "403 likes"  "383 likes"  "372 likes"  "360 likes"  "347 likes" 
# [15] "330 likes"  "329 likes"  "318 likes"  "317 likes"  "310 likes"  "281 likes" 



回答2:


This works for me...

library(rvest)
url <- "https://www.goodreads.com/quotes/search?page=1&q=simone+de+beauvoir&utf8=%E2%9C%93"
page <- read_html(url)
page %>% html_nodes("div.quote.mediumText") %>%   #select quote boxes
  html_node("a.smallText") %>%                    #then the smallText in each one
  html_text()

 [1] "2492 likes" "2265 likes" "2168 likes"
 [4] "2003 likes" "1774 likes" "1060 likes"
 [7] "580 likes"  "523 likes"  "482 likes" 
[10] "403 likes"  "383 likes"  "372 likes" 
[13] "360 likes"  "347 likes"  "330 likes" 
[16] "329 likes"  "318 likes"  "317 likes" 
[19] "310 likes"  "281 likes" 

Note the distinction between html_node and html_nodes. The advantage of selecting the quote boxes first is that you can then extract other information if you wish, which will then be easy to match with the number of likes.



来源:https://stackoverflow.com/questions/56496925/scrape-single-node-excluding-others-in-same-category

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!