R memory issues while webscraping with rvest

为君一笑 提交于 2019-12-01 04:46:40

问题


I am using rvest to webscrape in R, and I'm running into memory issues. I have a 28,625 by 2 data frame of strings called urls that contains the links to the pages I'm scraping. A row of the frame contains two related links. I want to generate a 28,625 by 4 data frame Final with information scraped from the links. One piece of information is from the second link in a row, and the other three are from the first link. The xpaths to the three pieces of information are stored as strings in the vector xpaths. I am doing this with the following code:

data <- rep("", 4 * 28625)

k <- 1

for (i in 1:28625) {

  name <- html(urls[i, 2]) %>%
    html_node(xpath = '//*[@id="seriesDiv"]/table') %>%
    html_table(fill = T)

  data[k] <- name[4, 3]

  data[k + 1:3] <- html(urls[i, 1]) %>% 
    html_nodes(xpath = xpaths) %>%
    html_text()

  k <- k + 4

}

dim(data) <- c(4, 28625)
Final <- as.data.frame(t(data))

It works well enough, but when I open the task manager, I see that my memory usage has been monotonically increasing and is currently at 97% after about 340 iterations. I'd like to just start the program and come back in a day or two, but all of my RAM will be exhausted before the job is done. I've done a bit of research on how R allocates memory, and I've tried my best to preallocate memory and modify in place, to keep the code from making unnecessary copies of things, etc.

Why is this so memory intensive? Is there anything I can do to resolve it?


回答1:


Rvest has been updated to resolve this issue. See here:

http://www.r-bloggers.com/rvest-0-3-0/



来源:https://stackoverflow.com/questions/31999766/r-memory-issues-while-webscraping-with-rvest

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!