extract links of subsequent images in div#data-old-hires

∥☆過路亽.° 提交于 2019-12-13 03:32:41

问题


With some help, I am able to extract the landing image/main image of a url. However, I would like to be able to extract the subsequent images as well

require(rvest)
url <-"https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka- 
Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1- 
spons&keywords=lunch+bag&psc=1"

webpage <- read_html(url)
r <- webpage %>%
        html_nodes("#landingImage") %>% 
        html_attr("data-a-dynamic-image")
imglink <- strsplit(r, '"')[[1]][2]
print(imglink)

This gives the correct output for the main image. However, I would like to extract the links when I roll-over to the other images of the same product. Essentially, I would like the output to have the following links:

1.https://images-na.ssl-images- amazon.com/images/I/81bF%2Ba21WLL.UY500.jpg

  1. https://images-na.ssl-images-amazon.com/images/I/81HVwttGJAL.UY500.jpg

  2. https://images-na.ssl-images-amazon.com/images/I/81Z1wxLn-uL.UY500.jpg

  3. https://images-na.ssl-images-amazon.com/images/I/91iKg%2BKqKML.UY500.jpg

  4. https://images-na.ssl-images-amazon.com/images/I/91zhpH7%2B8gL.UY500.jpg

Many thanks


回答1:


As requested Python script at bottom. In order to make this applicable across languages the answer is in two parts. 1) A high level pseudo code description of steps which can be carried out with R/Python/many other languages 2) A python example.

R script to obtain string shown at end (Steps 1-3 of Process).

1) Process:

  1. Obtain the html via GET request
  2. Regex out a substring from one of the script tags which is in fact what jquery on the page uses to provide the image links from json

The regex pattern is

jQuery\.parseJSON\(\'(.*)\'\);

The explanation is:

Basically, the contained json object is gathered starting at the { before "dataInJson" and ending before the characters '). That extracts this json object as string. The use of 1st Capturing Group (.*) gathers from between the start string and end string (excluding either side).

  1. The first match is the only one wanted, so out of matches returned the first must be extracted. This is then handled with a json parsing library that can take a string and return a json object
  2. That json object is looped accessing by key (in the case of Python as structure is a dictionary - R will be slightly different) colorImages, to generate the colours (of the product), which are in turn used to access the actual urls themselves.

colours:

nested level for images:

2) Those steps shown in Python

import requests  #library to handle xhr GET
import re #library to handle regex
import json

headers = {'User-Agent' : 'Mozilla/5.0', 'Referer':'https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc='}
r = requests.get('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', headers = headers)
p1 = re.compile(r'jQuery\.parseJSON\(\'(.*)\'\);')
data = p1.findall(r.text)[0]
json_source = json.loads(data)

for colour in json_source['colorImages']:
    for image in json_source['colorImages'][colour]:
        print(image['large'])

Output:

All the links for the product in all colours - large image links only (so urls appear slightly different and more numerous but are the same images)


R script to regex out required string and generate JSON:

library(rvest)
library( jsonlite)
library(stringr)

con <- url('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', "rb")
page = read_html(con)
page %>%
  html_nodes(xpath=".//script[contains(., 'colorImages')]")%>%
  html_text() %>% as.character %>% str_match(.,"jQuery\\.parseJSON\\(\\'(.*)\\'\\);") -> res

json = fromJSON(res[,2][2])

They've updated the page so now just use:

Python:

import requests  #library to handle xhr GET
import re #library to handle regex

headers = {'User-Agent' : 'Mozilla/5.0', 'Referer':'https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc='}
r = requests.get('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', headers = headers)
p1 = re.compile(r'"large":"(.*?)"')
links = p1.findall(r.text)
print(links)

R:

library(rvest)
library(stringr)
con <- url('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', "rb")
page = read_html(con)
res <- page %>%
  html_nodes(xpath=".//script[contains(., 'var data')]")%>%
  html_text() %>% as.character %>% 
  str_match_all(.,'"large":"(.*?)"')
print(res[[1]][,2])


来源:https://stackoverflow.com/questions/55819596/extract-links-of-subsequent-images-in-divdata-old-hires

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!