Scraping multiple URLs by looping in PhantomJS

前端 未结 2 1292
一个人的身影
一个人的身影 2020-12-11 10:57

I am using PhantomJS to scrape some websites and therefore extract information with r. I am following this tutorial. Everything works fine for a single page, but I couldn\'t

2条回答
  •  無奈伤痛
    2020-12-11 11:15

    Given my very very limited knowledge of js I thought about a workaround to the problem. I am still interested in solving the problem properly, but I foresee that will take quite some time.

    For the moment I got what I wanted by doing some experimental stuff in R. Instead of running the loop within js, I used R to write multiple single js codes, so that the "phantomjs is asynchronous problem" is bypassed.

    The trick consist in exporting the chunk of js code using write.table with the parameter quote=F, and using .js as file extension, so that it is correctly recognized as a js file. I guess this workaround has limited applicability to other similar tasks, but it might nonetheless help someone. Comments are very appreciated.

    countries <- c("Afghanistan", "Albania", "Algeria")
    
    for (i in unique(countries)){
    
      df <- data.frame(lines=character(11), 
                       stringsAsFactors=FALSE) 
      outputline <- paste("var path = '", i, ".html'" , sep="")
      inputline <- paste("page.open('http://www.kluwerarbitration.com/CommonUI/BITs.aspx?country=", i ,"', function (status) {", sep="")
      df$lines[1] <- "var webPage = require('webpage');"
      df$lines[2] <- "var page = webPage.create();"
      df$lines[3] <- "var fs = require('fs');"
      df$lines[4] <- ""
      df$lines[5] <- outputline
      df$lines[6] <- ""
      df$lines[7] <- inputline
      df$lines[8] <-  "  var content = page.content;"
      df$lines[9] <-  "  fs.write(path,content,'w')"
      df$lines[10] <-  "  phantom.exit();"
      df$lines[11] <-  "});"
    
      write.table(df, paste(i, ".js", sep = ""), sep=" ", quote=F, row.names=F, col.names=F)
    
    }
    
    library(rvest)
    library(stringr)
    library(plyr)
    library(dplyr)
    library(ggvis)
    library(knitr)
    options(digits = 4)
    
    
     #run all individual javascript files
    index <- 1
    for (i in countries){
     javacode <- paste0("./phantomjs", sep=" ",  countries, ".js")
      system(javacode[index])
     index <- index + 1
    }
    

提交回复
热议问题