I know R a bit, but not a pro. I am working on a text-mining project using R.
I searched Federal Reserve website with a keyword, say ‘inflation’. The second page of
Here you go. For the main search page, you can use a regular expression as the URL are easily identifiable in the source code.
(with the help of https://statistics.berkeley.edu/computing/r-reading-webpages)
library('RCurl')
library('stringr')
library('XML')
pageToRead <- readLines('https://search.newyorkfed.org/board_public/search?
start=10&Search=&number=10&text=inflation')
urlPattern <- 'URL: '
urlLines <- grep(urlPattern, pageToRead, value=TRUE)
getexpr <- function(s,g)substring(s, g, g + attr(g, 'match.length') - 1)
gg <- gregexpr(urlPattern, urlLines)
matches <- mapply(getexpr, urlLines, gg)
result = gsub(urlPattern,'\\1', matches)
names(result) = NULL
for (i in 1:length(result)) {
subURL <- result[i]
if (str_sub(subURL, -4, -1) == ".htm") {
content <- readLines(subURL)
doc <- htmlParse(content, asText=TRUE)
doc <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
writeLines(doc, paste("inflationText_", i, ".txt", sep=""))
}
}
However, as you probably noticed, this parses only the .htm pages, for the .pdf documents that are linked in the search result, I would advise you go have a look there: http://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/