extract links of subsequent images in div#data-old-hires

问题

With some help, I am able to extract the landing image/main image of a url. However, I would like to be able to extract the subsequent images as well

require(rvest)
url <-"https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka- 
Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1- 
spons&keywords=lunch+bag&psc=1"

webpage <- read_html(url)
r <- webpage %>%
        html_nodes("#landingImage") %>% 
        html_attr("data-a-dynamic-image")
imglink <- strsplit(r, '"')[[1]][2]
print(imglink)

This gives the correct output for the main image. However, I would like to extract the links when I roll-over to the other images of the same product. Essentially, I would like the output to have the following links:

1.https://images-na.ssl-images- amazon.com/images/I/81bF%2Ba21WLL.UY500.jpg

https://images-na.ssl-images-amazon.com/images/I/81HVwttGJAL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/81Z1wxLn-uL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/91iKg%2BKqKML.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/91zhpH7%2B8gL.UY500.jpg

Many thanks

回答1:

As requested Python script at bottom. In order to make this applicable across languages the answer is in two parts. 1) A high level pseudo code description of steps which can be carried out with R/Python/many other languages 2) A python example.

R script to obtain string shown at end (Steps 1-3 of Process).

1) Process:

Obtain the html via GET request
Regex out a substring from one of the script tags which is in fact what jquery on the page uses to provide the image links from json

The regex pattern is

jQuery\.parseJSON\(\'(.*)\'\);

The explanation is:

Basically, the contained json object is gathered starting at the { before "dataInJson" and ending before the characters '). That extracts this json object as string. The use of 1st Capturing Group (.*) gathers from between the start string and end string (excluding either side).

The first match is the only one wanted, so out of matches returned the first must be extracted. This is then handled with a json parsing library that can take a string and return a json object
That json object is looped accessing by key (in the case of Python as structure is a dictionary - R will be slightly different) colorImages, to generate the colours (of the product), which are in turn used to access the actual urls themselves.

colours:

nested level for images:

2) Those steps shown in Python

import requests  #library to handle xhr GET
import re #library to handle regex
import json

headers = {'User-Agent' : 'Mozilla/5.0', 'Referer':'https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc='}
r = requests.get('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', headers = headers)
p1 = re.compile(r'jQuery\.parseJSON\(\'(.*)\'\);')
data = p1.findall(r.text)[0]
json_source = json.loads(data)

for colour in json_source['colorImages']:
    for image in json_source['colorImages'][colour]:
        print(image['large'])

Output:

All the links for the product in all colours - large image links only (so urls appear slightly different and more numerous but are the same images)

R script to regex out required string and generate JSON:

library(rvest)
library( jsonlite)
library(stringr)

con <- url('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', "rb")
page = read_html(con)
page %>%
  html_nodes(xpath=".//script[contains(., 'colorImages')]")%>%
  html_text() %>% as.character %>% str_match(.,"jQuery\\.parseJSON\\(\\'(.*)\\'\\);") -> res

json = fromJSON(res[,2][2])

They've updated the page so now just use:

Python:

import requests  #library to handle xhr GET
import re #library to handle regex

headers = {'User-Agent' : 'Mozilla/5.0', 'Referer':'https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc='}
r = requests.get('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', headers = headers)
p1 = re.compile(r'"large":"(.*?)"')
links = p1.findall(r.text)
print(links)

library(rvest)
library(stringr)
con <- url('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', "rb")
page = read_html(con)
res <- page %>%
  html_nodes(xpath=".//script[contains(., 'var data')]")%>%
  html_text() %>% as.character %>% 
  str_match_all(.,'"large":"(.*?)"')
print(res[[1]][,2])

来源：https://stackoverflow.com/questions/55819596/extract-links-of-subsequent-images-in-divdata-old-hires

标签

rvest