问题
I try to extract some information from the website
library(rvest)
library(XML)
url <- "http://wiadomosci.onet.pl/wybory-prezydenckie/xcnpc"
html <- html(url)
nodes <- html_nodes(html, ".listItemSolr")
nodes
I get "list" of 30 parts of HTML code. I want from each element of the "list" extract last href attribute, so for the 30. element it would be
<a href="http://wiadomosci.onet.pl/kraj/w-sobote-prezentacja-hasla-i-programu-wyborczego-komorowskiego/tvgcq" title="W sobotę prezentacja hasła i programu wyborczego Komorowskiego">
so I want to get string
"http://wiadomosci.onet.pl/kraj/w-sobote-prezentacja-hasla-i-programu-wyborczego-komorowskiego/tvgcq"
The problem is html_attr(nodes, "href")
doesn't work (I get vector of NA's). So I thought about regex but the problem is that nodes
isn't the character list.
class(nodes)
[1] "XMLNodeSet"
I tried
xmlToList(nodes)
but it doesn't work either.
So my question is: how can I extract this url with some function created for HTML? Or, if it is not possible how can I get convert XMLNodeSet to character list?
回答1:
Try searching inside nodes' children:
nodes <- html_nodes(html, ".listItemSolr")
sapply(html_children(nodes), function(x){
html_attr( x$a, "href")
})
Update
Hadley suggested using elegant pipes:
html %>%
html_nodes(".listItemSolr") %>%
html_nodes(xpath = "./a") %>%
html_attr("href")
回答2:
Package XML function getHTMLLinks()
can do virtually all the work for us, we just have to write the xpath query. Here we query all the node attributes to determine if any contains "listItemSolr", then select the parent node for the href query.
getHTMLLinks(url, xpQuery = "//@*[contains(., 'listItemSolr')]/../a/@href")
In xpQuery
we are doing the following:
//@*[contains(., 'listItemSolr')]
query all node attributes for listItemSolr/..
select the parent node/a/@href
get the href links
来源:https://stackoverflow.com/questions/29042027/extracting-href-attr-or-converting-node-to-character-list