问题
Why is the page source of youtube.com not scrapeable?
I tried the following (using phantomjs as well as chrome with a selenium server)
library(RSelenium)
pJS <- phantom(pjs_cmd = ...)
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = 'phantomjs')
remDr$open()
remDr$navigate("https://www.youtube.com/")
remDr$getTitle()[[1]] # [1] "YouTube"
remDr$getPageSource()
Returns:
Error in fromJSON(content, handler, default.size, depth, allowComments, :
invalid JSON input
回答1:
Its an issue with encoding. Use the dev version for now until the next version is released to CRAN:
devtools::install_github("ropensci/RSelenium")
回答2:
I would agree that the problem is most probably with encoding.
For instance, such problem seems to appear on nasa.gov website only on topic pages related to American-Russian space collaboration (which suggests that it is due to cyrillic characters in webpages content).
I solved the problem by using deprecated Relenium
where RSelenium
fails. To make Relenium
run smoothly on Ubuntu 16.04
I had to install Firefox 25.0
and configure it in a way to prevent any updates. The other issue during set up was to properly install rJava
, which can fail due to lack of environment variables with proper paths to Java libraries.
System configuration is as follows:
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS
relenium_0.3.0; seleniumJars_2.41.0; rJava_0.9-8; RSelenium_1.3.5
Below is an example of a page that can be scraped with Relenium but not with release version of RSelenium:
link = "http://www.nasa.gov/mission_pages/station/expeditions/expedition14/index.html"
RSelenium solution fails (with Firefox of version either 34.0.5
, or 25.0
, no matter):
startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate(link)
doc = unlist(remDr$getPageSource())
Result: "Error in fromJSON(content, handler, default.size, depth, allowComments, : invalid JSON input"
While Relenium is ok with it:
relenium_browser <- firefoxClass$new()
relenium_browser$get(link)
doc = unlist(relenium_browser$getPageSource())
doc = read_html(doc)
来源:https://stackoverflow.com/questions/29994843/rselenium-get-youtube-page-source