问题
I'm trying to search for nodes in an html document using rvest in R. In the code below, I would like to know how return a NULL or NA when "s_BadgeTop*" is missing. It is only for academic purpose.
<div style="margin-bottom:0.5em;"><div><div style="float:left;">Por </div><div style="float:left;"><a href="/gp/pdp/profile/XXX" ><span style = "font-weight: bold;">JOHN</span></a> (UK) - <a href="/gp/cdp/member-reviews/XXX">Ver todas las opiniones</a><br /><span class="cmtySprite s_BadgeTop1000 " ><span>(TOP 1000 COMENTARISTAS)</span></span></div></div></div>
<div style="margin-bottom:0.5em;"><div><div style="float:left;">Por </div><div style="float:left;"><a href="/gp/pdp/profile/YYY" ><span style = "font-weight: bold;">MARY</span></a> (USA) - <a href="/gp/cdp/member-reviews/YYY">Ver todas las opiniones</a><br /></div></div></div>
<div style="margin-bottom:0.5em;"><div><div style="float:left;">Por </div><div style="float:left;"><a href="/gp/pdp/profile/ZZZ" ><span style = "font-weight: bold;">CANDICE</span></a> (UK) - <a href="/gp/cdp/member-reviews/ZZZ">Ver todas las opiniones</a><br /><span class="cmtySprite s_BadgeTop500 " ><span>(TOP 500 COMENTARISTAS)</span></span></div></div></div>
I need a data.frame with this structure:
- JOHN (TOP 1000 COMENTARISTAS)
- MARY NA
- CANDICE (TOP 500 COMENTARISTAS)
I have tried this code:
name <- pg %>%
html_nodes(xpath='//a[contains(@href,"/gp/pdp/profile/")]') %>%
html_text
status <- pg %>%
html_nodes(xpath='//span[contains(@class,"cmtySprite s_BadgeTop")]') %>%
html_text
status[is.na(status)] <- "NA"
but status[is.na(status)] <- "NA" does not work.
I get this output:
- JOHN (TOP 1000 COMENTARISTAS)
- MARY (TOP 500 COMENTARISTAS)
- CANDICE (TOP 1000 COMENTARISTAS)
Thanks!
回答1:
You can iterate over each of the three entries, extract name and - potentially the badge - from it, and ultimately merge all your results.
Example:
# For rbindlist
library(data.table)
# Function to parse a particular 'div' and extract name and (potentially) badge
parse_node <- function(node) {
name <- node %>%
html_node('a[href^="/gp/pdp/profile"]') %>%
html_text
badge <- node %>%
html_nodes('span[class*="s_BadgeTop"] span') %>%
html_text
list(name=name[1],badge=badge[1])
}
# extract nodes, parse and merge
pg %>%
html_nodes('div[style^="margin-bottom"] div div[style^=float]:nth-child(2)') %>%
lapply(parse_node) %>%
rbindlist
来源:https://stackoverflow.com/questions/29877451/rvest-missing-nodes-na