Extract text from search result URLs using R

前端未结

关注

 2  1348

礼貌的吻别 2020-12-22 09:58

I know R a bit, but not a pro. I am working on a text-mining project using R.

I searched Federal Reserve website with a keyword, say ‘inflation’. The second page of

2条回答

無奈伤痛 (楼主)

2020-12-22 10:19

Here you go. For the main search page, you can use a regular expression as the URL are easily identifiable in the source code.

(with the help of https://statistics.berkeley.edu/computing/r-reading-webpages)

library('RCurl')
library('stringr')
library('XML')

pageToRead <- readLines('https://search.newyorkfed.org/board_public/search?
start=10&Search=&number=10&text=inflation')
urlPattern <- 'URL: '
urlLines <- grep(urlPattern, pageToRead, value=TRUE)

getexpr <- function(s,g)substring(s, g, g + attr(g, 'match.length') - 1)
gg <- gregexpr(urlPattern, urlLines)
matches <- mapply(getexpr, urlLines, gg)
result = gsub(urlPattern,'\\1', matches)
names(result) = NULL


for (i in 1:length(result)) {
  subURL <- result[i]

  if (str_sub(subURL, -4, -1) == ".htm") {
    content <- readLines(subURL)
    doc <- htmlParse(content, asText=TRUE)
    doc <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
    writeLines(doc, paste("inflationText_", i, ".txt", sep=""))

  }
}

However, as you probably noticed, this parses only the .htm pages, for the .pdf documents that are linked in the search result, I would advise you go have a look there: http://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/

0 讨论(0)

查看其它2个回答