Scrape first class node but not child using rvest

半腔热情 提交于 2019-12-25 01:14:28

问题


many questions on this but couldn't see the answer I'm looking for.

Looking to extract a specific text, with a class .quoteText which with my code works, but also extracts all of the child nodes within .quoteText:

url <- "https://www.goodreads.com/quotes/search?page=1&q=simone+de+beauvoir&utf8=%E2%9C%93"

quote_text <- function(html){

  path <- read_html(html)

  path %>% 
    html_nodes(".quoteText") %>%
    html_text(trim = TRUE) %>% 
    str_trim(side = "both") %>% 
    unlist()
}

quote_text(url)

with the result containing the text, but also every child node!

This is what the inspector tool brings up. What I'm looking for is the highlighted line, but not the sub-lines under the same code.

There must be a way to scrape only that line, no? Or will I need to collect that line, and remove the rest with a str_extract / regex?


回答1:


It doesn't look like the CSS selectors support just getting the immediate text of the selected node, but xpath does. We can adjust your function to just extract the text with

quote_text <- function(html){

  path <- read_html(html)

  path %>% 
    html_nodes(xpath=paste(selectr::css_to_xpath(".quoteText"), "/text()") %>%
    html_text(trim = TRUE) %>% 
    str_trim(side = "both") %>% 
    unlist()
}

I convert the CSS selector to an xpath one and then append "/text()" to just get the text nodes of the elements.



来源:https://stackoverflow.com/questions/56484967/scrape-first-class-node-but-not-child-using-rvest

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!