Extract text from search result URLs using R

前端未结

关注

 2  1351

礼貌的吻别 2020-12-22 09:58

I know R a bit, but not a pro. I am working on a text-mining project using R.

I searched Federal Reserve website with a keyword, say ‘inflation’. The second page of

2条回答

一生所求 (楼主)

2020-12-22 10:12
This is a basic idea of how to go about scrapping this pages. Though it might be slow in r if there are many pages to be scrapped. Now your question is a bit ambiguous. You want the end results to be .txt files. What of the webpages that has pdf??? Okay. you can still use this code and change the file extension to pdf for the webpages that have pdfs.
```
 library(xml2)
 library(rvest)

 urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"

  urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%
       .[!duplicated(.)]%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
         Map(function(x,y) write_html(x,tempfile(y,fileext=".txt"),options="format"),.,
           c(paste("tmp",1:length(.))))
```
This is the breakdown of the code above: The url you want to scrap from:
```
 urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"
```
Get all the url's that you need:
```
  allurls <- urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%.[!duplicated(.)]
```
Where do you want to save your texts?? Create the temp files:
```
 tmps <- tempfile(c(paste("tmp",1:length(allurls))),fileext=".txt")
```
as per now. Your allurls is in class character. You have to change that to xml in order to be able to scrap them. Then finally write them into the tmp files created above:
```
  allurls%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
         Map(function(x,y) write_html(x,y,options="format"),.,tmps)
```
Please do not leave anything out. For example after ..."format"), there is a period. Take that into consideration. Now your files have been written in the tempdir. To determine where they are, just type the command tempdir() on the console and it should give you the location of your files. At the same time, you can change the location of the files on scrapping within the tempfile command.

Hope this helps.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...