rselenium | get youtube page source

独自空忆成欢 提交于 2019-12-11 04:26:50

问题


Why is the page source of youtube.com not scrapeable?

I tried the following (using phantomjs as well as chrome with a selenium server)

library(RSelenium)
pJS <- phantom(pjs_cmd = ...)
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = 'phantomjs')
remDr$open()
remDr$navigate("https://www.youtube.com/")
remDr$getTitle()[[1]] # [1] "YouTube"
remDr$getPageSource()

Returns:

Error in fromJSON(content, handler, default.size, depth, allowComments,  : 
  invalid JSON input

回答1:


Its an issue with encoding. Use the dev version for now until the next version is released to CRAN:

devtools::install_github("ropensci/RSelenium")



回答2:


I would agree that the problem is most probably with encoding.

For instance, such problem seems to appear on nasa.gov website only on topic pages related to American-Russian space collaboration (which suggests that it is due to cyrillic characters in webpages content).

I solved the problem by using deprecated Relenium where RSelenium fails. To make Relenium run smoothly on Ubuntu 16.04 I had to install Firefox 25.0 and configure it in a way to prevent any updates. The other issue during set up was to properly install rJava, which can fail due to lack of environment variables with proper paths to Java libraries.

System configuration is as follows:

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

relenium_0.3.0; seleniumJars_2.41.0; rJava_0.9-8; RSelenium_1.3.5 

Below is an example of a page that can be scraped with Relenium but not with release version of RSelenium:

link = "http://www.nasa.gov/mission_pages/station/expeditions/expedition14/index.html"

RSelenium solution fails (with Firefox of version either 34.0.5, or 25.0, no matter):

startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate(link)
doc = unlist(remDr$getPageSource())

Result: "Error in fromJSON(content, handler, default.size, depth, allowComments, : invalid JSON input"

While Relenium is ok with it:

 relenium_browser <- firefoxClass$new()
 relenium_browser$get(link)
 doc = unlist(relenium_browser$getPageSource())
 doc = read_html(doc)


来源:https://stackoverflow.com/questions/29994843/rselenium-get-youtube-page-source

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!