rvest table scraping including links

元气小坏坏 提交于 2019-12-10 11:33:38

问题


I would like to scrape some table data from Wikipedia. Some of the table columns include links to other articles I'd like to preserve. I've tried this approach, which didn't preserve the URLs. Looking at the html_table() function description, I didn't find any options of including those. Is there another package or way to do this?

library("rvest")

url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"

simp <- url %>%
        html() %>%
        html_nodes(xpath='//*[@id="mw-content-text"]/table[3]') %>%
        html_table()

simp <- simp[[1]]

回答1:


Try this

library(XML)
library(httr)
url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
doc <- content(GET(url))
getHrefs <- function(node, encoding) {  
  x <- xmlChildren(node)$a 
  if (!is.null(x)) paste0("http://", parseURI(url)$server, xmlGetAttr(x, "href"), " | ", xmlValue(x) ) else xmlValue(xmlChildren(node)$text) 
}
tab <- readHTMLTable(doc, which = 3, elFun = getHrefs)
head(tab[, 1:4])
# No. in\nseries No. in\nseason                                                                                              Title                                                               Directed by
# 1              1              1 http://en.wikipedia.org/wiki/Simpsons_Roasting_on_an_Open_Fire | Simpsons Roasting on an Open Fire http://en.wikipedia.org/wiki/David_Silverman_(animator) | David Silverman
# 2              2              2                                     http://en.wikipedia.org/wiki/Bart_the_Genius | Bart the Genius                                                           David Silverman
# 3              3              3                    http://en.wikipedia.org/wiki/Homer%27s_Odyssey_(The_Simpsons) | Homer's Odyssey                      http://en.wikipedia.org/wiki/Wes_Archer | Wes Archer
# 4              4              4       http://en.wikipedia.org/wiki/There%27s_No_Disgrace_Like_Home | There's No Disgrace Like Home                    http://en.wikipedia.org/wiki/Gregg_Vanzo | Gregg Vanzo
# 5              5              5                                   http://en.wikipedia.org/wiki/Bart_the_General | Bart the General                                                           David Silverman
# 6              6              6                                           http://en.wikipedia.org/wiki/Moaning_Lisa | Moaning Lisa                                                                Wes Archer

The URLs are preserved and separated by a pipe (|) from the text. So you could split it up for example by using strsplit(as.character(tab[, 3]), split = " | ", fixed = TRUE).



来源:https://stackoverflow.com/questions/31924546/rvest-table-scraping-including-links

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!