问题
I am fairly new to R (and using it for web scraping in particular), so any help is greatly appreciated. I am currently trying to mine a webpage that contains multiple ticket listings and lists additional details for some of these (like the ticket having an impaired view or being for children only). I want to extract this data, leaving blank spaces or NAs for the ticket listings that do not contain these details.
Since the original website requires the use of RSelenium, I have tried to replicate the HTML in a simpler form. If any information is missing, please let me know and I will try to provide it. Thanks!
So far, I have tried to adopt the solutions provided here: rvest missing nodes --> NA and htmlParse missing values NA , but am not able to replicate them for my example as I obtain the error message
Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"
I guess I do need a combination of rvest and lapply, but I do not seem to be able to make it work.
library(XML)
library(rvest)
html <- '<!DOCTYPE html>
<html>
<head>
</head>
<body>
<div class = "listing" id = "listing_1">
<em>
<span class="listing_sub3">
Limited view
</span>
</em>
</div>
<div class = "listing" id = "listing_2">
<em>
<span class="listing_sub2">
Other text I am not interested in
</span>
</em>
</div>
<div class = "listing" id = "listing_3">
<div>
<em>
<span class="listing_sub3">
Limited view
</span>
</em>
</div>
<div>
<span class="listing_sub1">
Ticket for a child
</span>
</div>
</div>
</body>
</html>'
page_html <- read_html(html)
child <- html_nodes(page_html, xpath ="//*[@class='listing_sub1']") %>%
html_text()
viewLim <- html_nodes(page_html, xpath ="//*[@class='listing_sub3']") %>%
html_text()
id <- html_nodes(page_html, xpath = "//*[@class='listing']") %>%
html_attr( ,name = "id")
I hope to obtain a table that looks similar to this:
listing child viewLim
1 F T
2 F F
3 T T
回答1:
The strategy in this solution is to create a list of nodes for each listing node and then search each of those nodes for the desired information, child and view limited.
Using html_node instead of html_nodes will always return a one value (even if it is just NA) this ensures the vector lengths are the same.
Also, with rvest
I prefer to use the CSS syntax instead of the xpath. In most cases the CSS is easier to use than the xpath expressions.
library(rvest)
page_html <- read_html(html)
#find the listing nodes and id of each node
listings<-html_nodes(page_html, "div.listing")
listing<-html_attr(listings ,name = "id")
#search each listing node for the child ticket and limit view criteria
child<-sapply(listings, function(x) {html_node(x, "span.listing_sub1") %>% html_text()} )
viewLim<-sapply(listings, function(x) {html_node(x, "span.listing_sub3") %>% html_text()})
#create dataframe
df<-data.frame(listing, child=!is.na(child), viewLim=!is.na(viewLim))
# df
# listing child viewLim
#1 listing_1 FALSE TRUE
#2 listing_2 FALSE FALSE
#3 listing_3 TRUE TRUE
来源:https://stackoverflow.com/questions/54478175/rvest-return-nas-for-empty-nodes-given-multiple-listings