Use xpathSApply in R

偶尔善良 提交于 2021-02-19 08:00:36

问题


I would like to get the information of href from below.

http://www.mitbbs.com/bbsdoc1/USANews_101_0.html

I prefer to get someting from each topic like this

/USANews/31587637.html

/USANews/31587633.html

/USANews/31587631.html

...

The code is used below, but it doesn't work.

library("XML")   
library("httr")
library("stringr")

data <- list()

for( i in 101:201){
url <- paste('bbsdoc1/USANews_', i, '_0.html', sep='')
html <- content(GET("http://www.mitbbs.com/", path = url),as = 'parsed')
url.list <- xpathSApply(html, "//td[@align='left' height=26]/[@class='news1' href]", xmlAttrs)
data <- rbind(data, url.list)

} 

Your suggestions are really appreicated!


回答1:


Retrieve the document

library(XML)
html = htmlParse("http://www.mitbbs.com/bbsdoc1/USANews_101_0.html")

and extract the links and text you're interested in using the appropriate xpath query

href = "//a[./@class='news1']/@href"
text = "//a[./@class='news1']/text()"
df = data.frame(
    url=sub("article_t/", "", sapply(html[href], as.character)),
    text=trimws(sapply(html[text], xmlValue)))

trimws() is a function in recent versions of R.




回答2:


You should look into the rvest package which simplifies things a lot

library(rvest); library(dplyr)
myList <- read_html("http://www.mitbbs.com/bbsdoc1/USANews_101_0.html") %>% 
                html_nodes(".news1") %>% xml_attr("href")
mtList

myList %>% gsub("/article_t", "", .)


来源:https://stackoverflow.com/questions/28814454/use-xpathsapply-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!