问题
With some help, I am able to extract the landing image/main image of a url. However, I would like to be able to extract the subsequent images as well
require(rvest)
url <-"https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-
Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-
spons&keywords=lunch+bag&psc=1"
webpage <- read_html(url)
r <- webpage %>%
html_nodes("#landingImage") %>%
html_attr("data-a-dynamic-image")
imglink <- strsplit(r, '"')[[1]][2]
print(imglink)
This gives the correct output for the main image. However, I would like to extract the links when I roll-over to the other images of the same product. Essentially, I would like the output to have the following links:
1.https://images-na.ssl-images- amazon.com/images/I/81bF%2Ba21WLL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/81HVwttGJAL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/81Z1wxLn-uL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/91iKg%2BKqKML.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/91zhpH7%2B8gL.UY500.jpg
Many thanks
回答1:
As requested Python script at bottom. In order to make this applicable across languages the answer is in two parts. 1) A high level pseudo code description of steps which can be carried out with R/Python/many other languages 2) A python example.
R script to obtain string shown at end (Steps 1-3 of Process).
1) Process:
- Obtain the html via GET request
- Regex out a substring from one of the script tags which is in fact what jquery on the page uses to provide the image links from json

The regex pattern is
jQuery\.parseJSON\(\'(.*)\'\);
The explanation is:

Basically, the contained json object is gathered starting at the {
before "dataInJson"
and ending before the characters ')
. That extracts this json object as string. The use of 1st Capturing Group (.*) gathers from between the start string and end string (excluding either side).
- The first match is the only one wanted, so out of matches returned the first must be extracted. This is then handled with a json parsing library that can take a string and return a json object
- That json object is looped accessing by key (in the case of Python as structure is a dictionary - R will be slightly different)
colorImages
, to generate the colours (of the product), which are in turn used to access the actual urls themselves.
colours:

nested level for images:

2) Those steps shown in Python
import requests #library to handle xhr GET
import re #library to handle regex
import json
headers = {'User-Agent' : 'Mozilla/5.0', 'Referer':'https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc='}
r = requests.get('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', headers = headers)
p1 = re.compile(r'jQuery\.parseJSON\(\'(.*)\'\);')
data = p1.findall(r.text)[0]
json_source = json.loads(data)
for colour in json_source['colorImages']:
for image in json_source['colorImages'][colour]:
print(image['large'])
Output:
All the links for the product in all colours - large image links only (so urls appear slightly different and more numerous but are the same images)
R script to regex out required string and generate JSON:
library(rvest)
library( jsonlite)
library(stringr)
con <- url('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', "rb")
page = read_html(con)
page %>%
html_nodes(xpath=".//script[contains(., 'colorImages')]")%>%
html_text() %>% as.character %>% str_match(.,"jQuery\\.parseJSON\\(\\'(.*)\\'\\);") -> res
json = fromJSON(res[,2][2])
They've updated the page so now just use:
Python:
import requests #library to handle xhr GET
import re #library to handle regex
headers = {'User-Agent' : 'Mozilla/5.0', 'Referer':'https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc='}
r = requests.get('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', headers = headers)
p1 = re.compile(r'"large":"(.*?)"')
links = p1.findall(r.text)
print(links)
R:
library(rvest)
library(stringr)
con <- url('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', "rb")
page = read_html(con)
res <- page %>%
html_nodes(xpath=".//script[contains(., 'var data')]")%>%
html_text() %>% as.character %>%
str_match_all(.,'"large":"(.*?)"')
print(res[[1]][,2])
来源:https://stackoverflow.com/questions/55819596/extract-links-of-subsequent-images-in-divdata-old-hires