I am using PhantomJS to scrape some websites and therefore extract information with r. I am following this tutorial. Everything works fine for a single page, but I couldn\'t
Given my very very limited knowledge of js I thought about a workaround to the problem. I am still interested in solving the problem properly, but I foresee that will take quite some time.
For the moment I got what I wanted by doing some experimental stuff in R. Instead of running the loop within js, I used R to write multiple single js codes, so that the "phantomjs is asynchronous problem" is bypassed.
The trick consist in exporting the chunk of js code using write.table with the parameter quote=F, and using .js as file extension, so that it is correctly recognized as a js file. I guess this workaround has limited applicability to other similar tasks, but it might nonetheless help someone. Comments are very appreciated.
countries <- c("Afghanistan", "Albania", "Algeria")
for (i in unique(countries)){
df <- data.frame(lines=character(11),
stringsAsFactors=FALSE)
outputline <- paste("var path = '", i, ".html'" , sep="")
inputline <- paste("page.open('http://www.kluwerarbitration.com/CommonUI/BITs.aspx?country=", i ,"', function (status) {", sep="")
df$lines[1] <- "var webPage = require('webpage');"
df$lines[2] <- "var page = webPage.create();"
df$lines[3] <- "var fs = require('fs');"
df$lines[4] <- ""
df$lines[5] <- outputline
df$lines[6] <- ""
df$lines[7] <- inputline
df$lines[8] <- " var content = page.content;"
df$lines[9] <- " fs.write(path,content,'w')"
df$lines[10] <- " phantom.exit();"
df$lines[11] <- "});"
write.table(df, paste(i, ".js", sep = ""), sep=" ", quote=F, row.names=F, col.names=F)
}
library(rvest)
library(stringr)
library(plyr)
library(dplyr)
library(ggvis)
library(knitr)
options(digits = 4)
#run all individual javascript files
index <- 1
for (i in countries){
javacode <- paste0("./phantomjs", sep=" ", countries, ".js")
system(javacode[index])
index <- index + 1
}