问题
I would like to scrape some table data from Wikipedia. Some of the table columns include links to other articles I'd like to preserve. I've tried this approach, which didn't preserve the URLs. Looking at the html_table() function description, I didn't find any options of including those. Is there another package or way to do this?
library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
simp <- url %>%
html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/table[3]') %>%
html_table()
simp <- simp[[1]]
回答1:
Try this
library(XML)
library(httr)
url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
doc <- content(GET(url))
getHrefs <- function(node, encoding) {
x <- xmlChildren(node)$a
if (!is.null(x)) paste0("http://", parseURI(url)$server, xmlGetAttr(x, "href"), " | ", xmlValue(x) ) else xmlValue(xmlChildren(node)$text)
}
tab <- readHTMLTable(doc, which = 3, elFun = getHrefs)
head(tab[, 1:4])
# No. in\nseries No. in\nseason Title Directed by
# 1 1 1 http://en.wikipedia.org/wiki/Simpsons_Roasting_on_an_Open_Fire | Simpsons Roasting on an Open Fire http://en.wikipedia.org/wiki/David_Silverman_(animator) | David Silverman
# 2 2 2 http://en.wikipedia.org/wiki/Bart_the_Genius | Bart the Genius David Silverman
# 3 3 3 http://en.wikipedia.org/wiki/Homer%27s_Odyssey_(The_Simpsons) | Homer's Odyssey http://en.wikipedia.org/wiki/Wes_Archer | Wes Archer
# 4 4 4 http://en.wikipedia.org/wiki/There%27s_No_Disgrace_Like_Home | There's No Disgrace Like Home http://en.wikipedia.org/wiki/Gregg_Vanzo | Gregg Vanzo
# 5 5 5 http://en.wikipedia.org/wiki/Bart_the_General | Bart the General David Silverman
# 6 6 6 http://en.wikipedia.org/wiki/Moaning_Lisa | Moaning Lisa Wes Archer
The URLs are preserved and separated by a pipe (|
) from the text. So you could split it up for example by using strsplit(as.character(tab[, 3]), split = " | ", fixed = TRUE)
.
来源:https://stackoverflow.com/questions/31924546/rvest-table-scraping-including-links