rvest table scraping including links

折月煮酒 提交于 2019-12-06 11:53:31

Try this

library(XML)
library(httr)
url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
doc <- content(GET(url))
getHrefs <- function(node, encoding) {  
  x <- xmlChildren(node)$a 
  if (!is.null(x)) paste0("http://", parseURI(url)$server, xmlGetAttr(x, "href"), " | ", xmlValue(x) ) else xmlValue(xmlChildren(node)$text) 
}
tab <- readHTMLTable(doc, which = 3, elFun = getHrefs)
head(tab[, 1:4])
# No. in\nseries No. in\nseason                                                                                              Title                                                               Directed by
# 1              1              1 http://en.wikipedia.org/wiki/Simpsons_Roasting_on_an_Open_Fire | Simpsons Roasting on an Open Fire http://en.wikipedia.org/wiki/David_Silverman_(animator) | David Silverman
# 2              2              2                                     http://en.wikipedia.org/wiki/Bart_the_Genius | Bart the Genius                                                           David Silverman
# 3              3              3                    http://en.wikipedia.org/wiki/Homer%27s_Odyssey_(The_Simpsons) | Homer's Odyssey                      http://en.wikipedia.org/wiki/Wes_Archer | Wes Archer
# 4              4              4       http://en.wikipedia.org/wiki/There%27s_No_Disgrace_Like_Home | There's No Disgrace Like Home                    http://en.wikipedia.org/wiki/Gregg_Vanzo | Gregg Vanzo
# 5              5              5                                   http://en.wikipedia.org/wiki/Bart_the_General | Bart the General                                                           David Silverman
# 6              6              6                                           http://en.wikipedia.org/wiki/Moaning_Lisa | Moaning Lisa                                                                Wes Archer

The URLs are preserved and separated by a pipe (|) from the text. So you could split it up for example by using strsplit(as.character(tab[, 3]), split = " | ", fixed = TRUE).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!