问题
I would like to get the information of href from below.
http://www.mitbbs.com/bbsdoc1/USANews_101_0.html
I prefer to get someting from each topic like this
/USANews/31587637.html
/USANews/31587633.html
/USANews/31587631.html
...
The code is used below, but it doesn't work.
library("XML")
library("httr")
library("stringr")
data <- list()
for( i in 101:201){
url <- paste('bbsdoc1/USANews_', i, '_0.html', sep='')
html <- content(GET("http://www.mitbbs.com/", path = url),as = 'parsed')
url.list <- xpathSApply(html, "//td[@align='left' height=26]/[@class='news1' href]", xmlAttrs)
data <- rbind(data, url.list)
}
Your suggestions are really appreicated!
回答1:
Retrieve the document
library(XML)
html = htmlParse("http://www.mitbbs.com/bbsdoc1/USANews_101_0.html")
and extract the links and text you're interested in using the appropriate xpath query
href = "//a[./@class='news1']/@href"
text = "//a[./@class='news1']/text()"
df = data.frame(
url=sub("article_t/", "", sapply(html[href], as.character)),
text=trimws(sapply(html[text], xmlValue)))
trimws() is a function in recent versions of R.
回答2:
You should look into the rvest package which simplifies things a lot
library(rvest); library(dplyr)
myList <- read_html("http://www.mitbbs.com/bbsdoc1/USANews_101_0.html") %>%
html_nodes(".news1") %>% xml_attr("href")
mtList
myList %>% gsub("/article_t", "", .)
来源:https://stackoverflow.com/questions/28814454/use-xpathsapply-in-r