rvest table scraping including links

问题

I would like to scrape some table data from Wikipedia. Some of the table columns include links to other articles I'd like to preserve. I've tried this approach, which didn't preserve the URLs. Looking at the html_table() function description, I didn't find any options of including those. Is there another package or way to do this?

library("rvest")

url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"

simp <- url %>%
        html() %>%
        html_nodes(xpath='//*[@id="mw-content-text"]/table[3]') %>%
        html_table()

simp <- simp[[1]]

回答1:

Try this

library(XML)
library(httr)
url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
doc <- content(GET(url))
getHrefs <- function(node, encoding) {  
  x <- xmlChildren(node)$a 
  if (!is.null(x)) paste0("http://", parseURI(url)$server, xmlGetAttr(x, "href"), " | ", xmlValue(x) ) else xmlValue(xmlChildren(node)$text) 
}
tab <- readHTMLTable(doc, which = 3, elFun = getHrefs)
head(tab[, 1:4])
# No. in\nseries No. in\nseason                                                                                              Title                                                               Directed by
# 1              1              1 http://en.wikipedia.org/wiki/Simpsons_Roasting_on_an_Open_Fire | Simpsons Roasting on an Open Fire http://en.wikipedia.org/wiki/David_Silverman_(animator) | David Silverman
# 2              2              2                                     http://en.wikipedia.org/wiki/Bart_the_Genius | Bart the Genius                                                           David Silverman
# 3              3              3                    http://en.wikipedia.org/wiki/Homer%27s_Odyssey_(The_Simpsons) | Homer's Odyssey                      http://en.wikipedia.org/wiki/Wes_Archer | Wes Archer
# 4              4              4       http://en.wikipedia.org/wiki/There%27s_No_Disgrace_Like_Home | There's No Disgrace Like Home                    http://en.wikipedia.org/wiki/Gregg_Vanzo | Gregg Vanzo
# 5              5              5                                   http://en.wikipedia.org/wiki/Bart_the_General | Bart the General                                                           David Silverman
# 6              6              6                                           http://en.wikipedia.org/wiki/Moaning_Lisa | Moaning Lisa                                                                Wes Archer

The URLs are preserved and separated by a pipe (|) from the text. So you could split it up for example by using strsplit(as.character(tab[, 3]), split = " | ", fixed = TRUE).

来源：https://stackoverflow.com/questions/31924546/rvest-table-scraping-including-links

标签

web-scraping

rvest