Extracting href attr or converting node to character list

£可爱£侵袭症+ 提交于 2019-12-24 12:18:27

问题


I try to extract some information from the website

library(rvest)
library(XML)
url <- "http://wiadomosci.onet.pl/wybory-prezydenckie/xcnpc"
html <- html(url)

nodes <- html_nodes(html, ".listItemSolr") 
nodes

I get "list" of 30 parts of HTML code. I want from each element of the "list" extract last href attribute, so for the 30. element it would be

<a href="http://wiadomosci.onet.pl/kraj/w-sobote-prezentacja-hasla-i-programu-wyborczego-komorowskiego/tvgcq" title="W sobotę prezentacja hasła i programu wyborczego Komorowskiego">

so I want to get string

"http://wiadomosci.onet.pl/kraj/w-sobote-prezentacja-hasla-i-programu-wyborczego-komorowskiego/tvgcq"

The problem is html_attr(nodes, "href") doesn't work (I get vector of NA's). So I thought about regex but the problem is that nodes isn't the character list.

class(nodes)
[1] "XMLNodeSet"

I tried

xmlToList(nodes)

but it doesn't work either.

So my question is: how can I extract this url with some function created for HTML? Or, if it is not possible how can I get convert XMLNodeSet to character list?


回答1:


Try searching inside nodes' children:

nodes <- html_nodes(html, ".listItemSolr") 

sapply(html_children(nodes), function(x){
  html_attr( x$a, "href")
})

Update

Hadley suggested using elegant pipes:

html %>%  
  html_nodes(".listItemSolr") %>% 
  html_nodes(xpath = "./a") %>% 
  html_attr("href")



回答2:


Package XML function getHTMLLinks() can do virtually all the work for us, we just have to write the xpath query. Here we query all the node attributes to determine if any contains "listItemSolr", then select the parent node for the href query.

getHTMLLinks(url, xpQuery = "//@*[contains(., 'listItemSolr')]/../a/@href")

In xpQuery we are doing the following:

  • //@*[contains(., 'listItemSolr')] query all node attributes for listItemSolr
  • /.. select the parent node
  • /a/@href get the href links


来源:https://stackoverflow.com/questions/29042027/extracting-href-attr-or-converting-node-to-character-list

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!