Scraping webpage with react JS in R

别来无恙 提交于 2019-11-30 18:34:04

问题



I'm trying to scrape page below : https://metro.zakaz.ua/uk/?promotion=1
This page with react content.
I can scrape first page with code:

url="https://metro.zakaz.ua/uk/?promotion=1"

read_html(url)%>%
  html_nodes("script")%>%
  .[[8]] %>% 
  html_text()%>%
  fromJSON()%>%
  .$catalog%>%.$items%>%
  data.frame

In result I have all items from first page, but I don't know how to scrape others pages.
This js code move to other page if that can help:

document.querySelectorAll('.catalog-pagination')[0].children[1].children[0].click()

Thanks for any help!


回答1:


You will need 'RSelenum' to perform headless navigation.

Check out for setting up: How to set up rselenium for R?

library(RSelenium)
library(rvest)
library(tidyvers)

url="https://metro.zakaz.ua/uk/?promotion=1"

rD <- rsDriver(port=4444L, browser="chrome")
remDr <- rD[['client']]

remDr$navigate(url)

### adjust items you want to scrape 
    src <- remDr$getPageSource()[[1]]

    pg <- read_html(src)
    tbl <- tibble(
                    product_name = pg %>% html_nodes(".product-card-name") %>% html_text(),
                    product_info = pg %>% html_nodes(".product-card-info") %>% html_text()
                    )

## to handle pagenation (tested with 5 pages) - adjust accordinly
for (i in 2:5) {
    pages <- remDr$findElement(using = 'css selector',str_c(".page:nth-child(",i,")"))

    pages$clickElement()  

    ## wait 5 sec to load
    Sys.sleep(5)

    src <- remDr$getPageSource()[[1]]

        pg <- read_html(src)
        data <- tibble(
                    product_name = pg %>% html_nodes(".product-card-name") %>% html_text(),
                    product_info = pg %>% html_nodes(".product-card-info") %>% html_text()
                    )
        tbl <- tbl %>% bind_rows(data)
}

nrow(tbl)
head(tbl)
tail(tbl)

here's a quick output:

Output




回答2:


Try to ajust your code a little i

    from selenium import webdriver

    driver = webdriver.Firefox()
    current_page = 1
    url="https://metro.zakaz.ua/uk/?promotion="+str(current_page)

    driver.get(url)
    #gets all elements with class page 
    pages = driver.find_elements_by_class_name("page")

    for i in pages:
         #it will update the code every time it has clicked on to a page button
         html = driver.page_source
         #Here you put your code for scrapping and that's it 
         #gets the next page and will continue till there are no more pages
         if int(i.text) == current_page + 1:
            i.click()
            current_page +=1


来源:https://stackoverflow.com/questions/51536339/scraping-webpage-with-react-js-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!