How to scrape JavaScript rendered Website by R?

问题

Just wanna ask if there is any good approach to scrape the website below? https://list.jd.com/list.html?cat=737,794,798&page=1&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main

Basically I want to get the name and price of all products However, the price info is stored in some JQuery scripts

Is Selenium the only solution? Thought of using V8 / Jsonlite, but it seems that they are not applicable. It'd be great if you can offer some alternatives in R. (Access to exe files is blocked in my computer, I cannot use Selenium / PhantomJS]

回答1:

Couldn't find any robots.txt or terms/conditions that bar scraping (if someone does find that please flag in a comment so I can delete the answer):

library(rvest)
library(V8)
library(tidyverse)

pg <- read_html("https://list.jd.com/list.html?cat=737,794,798&page=1&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main")

Tagging the question with V8 was a 👍🏼 idea.

ctx <- v8()

We need to add two missing global variables, then evaluate the javascript:

paste0(
  c("var window = {}, SEARCH = {};",
    html_nodes(pg, "script")[[1]] %>%
      html_text()
  ),
  collapse = "\n"
) %>%
  ctx$eval()
## [1] "[object Object]"

Now get some data out:

ctx$get("aosList") %>%
  bind_rows(.id = "id") %>%
  tbl_df()
## # A tibble: 175 x 3
##    id      n                     v         
##    <chr>   <chr>                 <chr>     
##  1 1429810 39-45英寸             244_110017
##  2 1429810 全高清（1920×1080）   3613_77848
##  3 1429810 3级                   1200_1656 
##  4 4286570 39-45英寸             244_110017
##  5 4286570 高清（1366×768）      3613_93579
##  6 4286570 3级                   1200_1656 
##  7 4609652 55英寸                244_1486  
##  8 4609652 4k超高清（3840×2160） 3613_77847
##  9 4609652 3级                   1200_1656 
## 10 4609660 65英寸                244_58269 
## # ... with 165 more rows

And, more data:

ctx$get("attrList") %>%
  bind_rows(.id = "id") %>%
  tbl_df()
## # A tibble: 60 x 15
##    id      IsSam    cw factoryShip isCanUseDQ isJDexpress  isJX isOverseaPurchase mcat3Id soldOS  tssp venderType xgzs 
##    <chr>   <int> <int>       <int>      <int>       <int> <int>             <int>   <int>  <int> <int> <chr>      <chr>
##  1 1429810     0     1           0          0           0     0                 0     798     -1     0 0          7.3  
##  2 4286570     0     1          NA          0           0     0                 0     798     -1     0 0          6.2  
##  3 4609652     0     1          NA          0           0     0                 0     798     -1     0 0          7.5  
##  4 4609660     0     1          NA          0           0     0                 0     798     -1     0 0          8.8  
##  5 4620979     0     1          NA          0           0     0                 0     798     -1     0 0          6.4  
##  6 4751739     0     1          NA          1           0     0                 0     798     -1     0 0          8.9  
##  7 4902977     0     1          NA         NA           0     0                 0     798     -1     0 0          9.5  
##  8 5010925     0     1          NA          1           0     0                 0     798     -1     0 0          8.6  
##  9 5102214     0     1          NA          0           0     0                 0     798     -1     0 0          7.8  
## 10 5218185     0     1          NA          1           0     0                 0     798     -1     0 0          <NA> 
## # ... with 50 more rows, and 2 more variables: isFzxp <int>, shipFareTmplId <int>

来源：https://stackoverflow.com/questions/52534309/how-to-scrape-javascript-rendered-website-by-r

标签

javascript

rvest

httr