Extract text from search result URLs using R

流过昼夜 提交于 2019-11-28 12:26:44

问题


I know R a bit, but not a pro. I am working on a text-mining project using R.

I searched Federal Reserve website with a keyword, say ‘inflation’. The second page of the search result has the URL: (https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation).

This page has 10 search results (10 URLs). I want to write a code in R which will ‘read’ the page corresponding to each of those 10 URLs and extract the texts from those web pages to .txt files. My only input is the above mentioned URL.

I appreciate your help. If there is any similar older post, please refer me that too. Thank you.


回答1:


This is a basic idea of how to go about scrapping this pages. Though it might be slow in r if there are many pages to be scrapped. Now your question is a bit ambiguous. You want the end results to be .txt files. What of the webpages that has pdf??? Okay. you can still use this code and change the file extension to pdf for the webpages that have pdfs.

 library(xml2)
 library(rvest)

 urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"

  urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%
       .[!duplicated(.)]%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
         Map(function(x,y) write_html(x,tempfile(y,fileext=".txt"),options="format"),.,
           c(paste("tmp",1:length(.))))

This is the breakdown of the code above: The url you want to scrap from:

 urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"

Get all the url's that you need:

  allurls <- urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%.[!duplicated(.)]

Where do you want to save your texts?? Create the temp files:

 tmps <- tempfile(c(paste("tmp",1:length(allurls))),fileext=".txt")

as per now. Your allurls is in class character. You have to change that to xml in order to be able to scrap them. Then finally write them into the tmp files created above:

  allurls%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
         Map(function(x,y) write_html(x,y,options="format"),.,tmps)

Please do not leave anything out. For example after ..."format"), there is a period. Take that into consideration. Now your files have been written in the tempdir. To determine where they are, just type the command tempdir() on the console and it should give you the location of your files. At the same time, you can change the location of the files on scrapping within the tempfile command.

Hope this helps.




回答2:


Here you go. For the main search page, you can use a regular expression as the URL are easily identifiable in the source code.

(with the help of https://statistics.berkeley.edu/computing/r-reading-webpages)

library('RCurl')
library('stringr')
library('XML')

pageToRead <- readLines('https://search.newyorkfed.org/board_public/search?
start=10&Search=&number=10&text=inflation')
urlPattern <- 'URL: <a href="(.+)">'
urlLines <- grep(urlPattern, pageToRead, value=TRUE)

getexpr <- function(s,g)substring(s, g, g + attr(g, 'match.length') - 1)
gg <- gregexpr(urlPattern, urlLines)
matches <- mapply(getexpr, urlLines, gg)
result = gsub(urlPattern,'\\1', matches)
names(result) = NULL


for (i in 1:length(result)) {
  subURL <- result[i]

  if (str_sub(subURL, -4, -1) == ".htm") {
    content <- readLines(subURL)
    doc <- htmlParse(content, asText=TRUE)
    doc <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
    writeLines(doc, paste("inflationText_", i, ".txt", sep=""))

  }
}

However, as you probably noticed, this parses only the .htm pages, for the .pdf documents that are linked in the search result, I would advise you go have a look there: http://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/



来源:https://stackoverflow.com/questions/45908989/extract-text-from-search-result-urls-using-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!