RSelenium: Scraping a dynamically loaded page that loads slowly

问题

I'm not sure if it is because my internet is slow, but I'm trying to scrape a website that loads information as you scroll down the page. I'm executing a script that goes to the end of the page, and waits for the Selenium/Chrome server to load the additional content. The server does update and load the new content, because I am able to scrape information that wasn't on the page originally and the new content shows up on the chrome viewer, but it only updates once. I set a Sys.sleep() function to wait for a minute each time so that the content will have plenty of time to load, but it still doesn't update more than once. Am I using RSelenium incorrectly? Are there other ways of scraping a site that dynamically loads?

Anyway, any kind of advice or help you can provide would be awesome.

Below is what I think is the relevant portion of my code with regards to loading the new content at the end of the page:

for(i in 1:3){
  webElem <- remDr$findElement('css', 'body')
  remDr$executeScript('window.scrollTo(0, document.body.scrollHeight);') 
  Sys.sleep(60)
}

Below is the full code:

library(RSelenium)
library(rvest)
library(stringr)

rsDriver(port = 4444L, browser = 'chrome')
remDr <- remoteDriver(browser = 'chrome')
remDr$open()
remDr$navigate('http://www.codewars.com/kata')

#find the total number of recorded katas
tot_kata <- remDr$findElement(using = 'css', '.is-gray-text')$getElementText() %>%
  unlist() %>%
  str_extract('\\d+') %>%
  as.numeric()

#there are about 30 katas per page reload
tot_pages <- (tot_kata/30) %>%
  ceiling()

#will be 1:tot_pages once I know the below code works
for(i in 1:3){
  webElem <- remDr$findElement('css', 'body')
  remDr$executeScript('window.scrollTo(0, document.body.scrollHeight);') 
  Sys.sleep(60)
}

page_source <- remDr$getPageSource()

kata_vector <- read_html(page_source[[1]]) %>%
  html_nodes('.item-title a') %>%
  html_attr('href') %>%
  str_replace('/kata/', '')

remDr$close

回答1:

The website provides an api which should be the first port of call. Failing this you can access individual pages using for example:

http://www.codewars.com/kata?page=21

If you want to scroll to the bottom of the page until there is no more content with RSelenium you can use the "Loading..." element it has a class=js-infinite-marker. While we still have this element on the page we attempt to scroll down to it every second (with some error catching for any issues). If the element is not present we assume all content is loaded:

library(RSelenium)

rD <- rsDriver(port = 4444L, browser = 'chrome')
remDr <- rD$client # You dont need to use the open method 
remDr$navigate('http://www.codewars.com/kata')
chk <- FALSE
while(!chk){
  webElem <- remDr$findElements("css", ".js-infinite-marker")
  if(length(webElem) > 0L){
    tryCatch(
      remDr$executeScript("elem = arguments[0]; 
                      elem.scrollIntoView(); 
                        return true;", list(webElem[[1]])), 
      error = function(e){}
    )
    Sys.sleep(1L)
  }else{
    chk <- TRUE
  }
}

来源：https://stackoverflow.com/questions/42595268/rselenium-scraping-a-dynamically-loaded-page-that-loads-slowly

标签

selenium

selenium-chromedriver

rvest

rselenium