rvest: Return NAs for empty nodes given multiple listings

问题

I am fairly new to R (and using it for web scraping in particular), so any help is greatly appreciated. I am currently trying to mine a webpage that contains multiple ticket listings and lists additional details for some of these (like the ticket having an impaired view or being for children only). I want to extract this data, leaving blank spaces or NAs for the ticket listings that do not contain these details.

Since the original website requires the use of RSelenium, I have tried to replicate the HTML in a simpler form. If any information is missing, please let me know and I will try to provide it. Thanks!

So far, I have tried to adopt the solutions provided here: rvest missing nodes --> NA and htmlParse missing values NA , but am not able to replicate them for my example as I obtain the error message

Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"

I guess I do need a combination of rvest and lapply, but I do not seem to be able to make it work.

library(XML)
library(rvest)

html <- '<!DOCTYPE html>
<html>
<head>
</head>
<body>
<div class = "listing" id = "listing_1">
<em> 
<span class="listing_sub3">
Limited view
</span>
</em>
</div>
<div class = "listing" id = "listing_2">
<em> 
<span class="listing_sub2">
Other text I am not interested in
</span>
</em>
</div>
<div class = "listing" id = "listing_3">
<div>
<em> 
<span class="listing_sub3">
Limited view
</span>
</em>
</div>
<div>
<span class="listing_sub1">
Ticket for a child
</span>
</div>
</div>
</body>
</html>'


page_html <- read_html(html)
child <- html_nodes(page_html, xpath ="//*[@class='listing_sub1']") %>%
  html_text()
viewLim <- html_nodes(page_html, xpath ="//*[@class='listing_sub3']") %>%
  html_text()
id <- html_nodes(page_html, xpath = "//*[@class='listing']") %>% 
  html_attr( ,name = "id")

I hope to obtain a table that looks similar to this:

listing  child   viewLim
1        F       T       
2        F       F      
3        T       T

回答1:

The strategy in this solution is to create a list of nodes for each listing node and then search each of those nodes for the desired information, child and view limited.

Using html_node instead of html_nodes will always return a one value (even if it is just NA) this ensures the vector lengths are the same.

Also, with rvest I prefer to use the CSS syntax instead of the xpath. In most cases the CSS is easier to use than the xpath expressions.

library(rvest)

page_html <- read_html(html)
#find the listing nodes and id of each node
listings<-html_nodes(page_html, "div.listing")
listing<-html_attr(listings ,name = "id") 

#search each listing node for the child ticket and limit view criteria
child<-sapply(listings, function(x) {html_node(x, "span.listing_sub1") %>% html_text()} ) 
viewLim<-sapply(listings, function(x) {html_node(x, "span.listing_sub3") %>% html_text()}) 

#create dataframe
df<-data.frame(listing, child=!is.na(child), viewLim=!is.na(viewLim))

# df
#    listing child viewLim
#1 listing_1 FALSE    TRUE
#2 listing_2 FALSE   FALSE
#3 listing_3  TRUE    TRUE

来源：https://stackoverflow.com/questions/54478175/rvest-return-nas-for-empty-nodes-given-multiple-listings

标签

web-scraping

rvest