Scraping multiple URLs by looping in PhantomJS

前端未结

关注

 2  1292

一个人的身影 2020-12-11 10:57

I am using PhantomJS to scrape some websites and therefore extract information with r. I am following this tutorial. Everything works fine for a single page, but I couldn\'t

2条回答

無奈伤痛 (楼主)

2020-12-11 11:15

Given my very very limited knowledge of js I thought about a workaround to the problem. I am still interested in solving the problem properly, but I foresee that will take quite some time.

For the moment I got what I wanted by doing some experimental stuff in R. Instead of running the loop within js, I used R to write multiple single js codes, so that the "phantomjs is asynchronous problem" is bypassed.

The trick consist in exporting the chunk of js code using write.table with the parameter quote=F, and using .js as file extension, so that it is correctly recognized as a js file. I guess this workaround has limited applicability to other similar tasks, but it might nonetheless help someone. Comments are very appreciated.

countries <- c("Afghanistan", "Albania", "Algeria")

for (i in unique(countries)){

  df <- data.frame(lines=character(11), 
                   stringsAsFactors=FALSE) 
  outputline <- paste("var path = '", i, ".html'" , sep="")
  inputline <- paste("page.open('http://www.kluwerarbitration.com/CommonUI/BITs.aspx?country=", i ,"', function (status) {", sep="")
  df$lines[1] <- "var webPage = require('webpage');"
  df$lines[2] <- "var page = webPage.create();"
  df$lines[3] <- "var fs = require('fs');"
  df$lines[4] <- ""
  df$lines[5] <- outputline
  df$lines[6] <- ""
  df$lines[7] <- inputline
  df$lines[8] <-  "  var content = page.content;"
  df$lines[9] <-  "  fs.write(path,content,'w')"
  df$lines[10] <-  "  phantom.exit();"
  df$lines[11] <-  "});"

  write.table(df, paste(i, ".js", sep = ""), sep=" ", quote=F, row.names=F, col.names=F)

}

library(rvest)
library(stringr)
library(plyr)
library(dplyr)
library(ggvis)
library(knitr)
options(digits = 4)


 #run all individual javascript files
index <- 1
for (i in countries){
 javacode <- paste0("./phantomjs", sep=" ",  countries, ".js")
  system(javacode[index])
 index <- index + 1
}

0 讨论(0)

查看其它2个回答