Scraping javascript website in R

前端 未结 2 1513
慢半拍i
慢半拍i 2020-12-04 22:58

I want to scrape the match time and date from this url:

http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary

By using the chrome dev t

相关标签:
2条回答
  • 2020-12-04 23:48

    You could also use docker as the web driver (in place of selenium)

    You will still need to install phantomjs, and docker too. Then run:

    library(RSelenium)
    
    url <- "http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary"
    
    system('docker run -d -p 4445:4444 selenium/standalone-chrome') 
    remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "chrome")
    remDr$open()
    remDr$navigate(url)
    
    writeLines(sprintf("var page = require('webpage').create();
    page.open('%s', function () {
        console.log(page.content); //page source
        phantom.exit();
    });", url), con="scrape.js")
    
    system("phantomjs scrape.js > scrape.html", intern = T)
    
    # extract the content you need
    pg <- read_html("scrape.html")
    pg %>% html_nodes("#utime") %>% html_text()
    
    # [1] "10:20 AM, October 28, 2014"
    
    0 讨论(0)
  • 2020-12-05 00:00

    So, RSelenium is not the only answer (anymore). If you can install the PhantomJS binary (grab phantomjs binaries from here: http://phantomjs.org/) then you can use it to render the HTML and scrape it with rvest (similar to the RSelenium approach but doesn't require java):

    library(rvest)
    
    # render HTML from the site with phantomjs
    
    url <- "http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary"
    
    writeLines(sprintf("var page = require('webpage').create();
    page.open('%s', function () {
        console.log(page.content); //page source
        phantom.exit();
    });", url), con="scrape.js")
    
    system("phantomjs scrape.js > scrape.html", intern = T)
    
    # extract the content you need
    pg <- html("scrape.html")
    pg %>% html_nodes("#utime") %>% html_text()
    
    ## [1] "10:20 AM, October 28, 2014"
    
    0 讨论(0)
提交回复
热议问题