Having trouble accessing xpath attribute with scrapy

问题

I am currently trying to scrape the following url: http://www.bedbathandbeyond.com/store/product/dyson-dc59-motorhead-cordless-vacuum/1042997979?categoryId=10562

On this page, I want to extract the number of reviews listed. That is, I want to extract the number 693.

This is my current xpath:

sel.xpath('//*[@id="BVRRRatingSummaryLinkReadID"]/a/span/span')

It seems to be only returning an empty array, can someone suggest a correct xpath?

回答1:

There are no reviews on the initial page you are getting with Scrapy. The problem is that the reviews are loaded and constructed via the heavy use of javascript which makes things more complicated.

Basically, your options are:

a high-level approach (for example, use a real browser with selenium). You can even combine Scrapy and Selenium:
- selenium with scrapy for dynamic page
- Scraping with Scrapy and Selenium
- scrapy-webdriver
a middle-level approach: scrapy + scrapyjs
a low-level approach (find out where the reviews are constructed and get them)

Here is a working example of the low-level approach involving parsing of a javascript code with json and slimit, extracting HTML from it and parsing it via BeautifulSoup:

import json

from bs4 import BeautifulSoup
import requests
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor

ID = 1042997979

url = 'http://bedbathandbeyond.ugc.bazaarvoice.com/2009-en_us/{id}/reviews.djs?format=embeddedhtml&sort=submissionTime'.format(id=ID)

response = requests.get(url)

parser = Parser()
tree = parser.parse(response.content)
data = ""
for node in nodevisitor.visit(tree):
    if isinstance(node, ast.Object):
        data = json.loads(node.to_ecma())
        if "BVRRSourceID" in data:
            break

soup = BeautifulSoup(data['BVRRSourceID'])
print soup.select('span.BVRRCount span.BVRRNumber')[0].text

Prints 693.

To adapt the solution to Scrapy, you would need to make a request with Scrapy instead of requests, and parse the HTML with Scrapy instead of BeautifulSoup.

回答2:

You cannot do that. If you merely crawled the html from this url, you won't find any string of 693. This content must be created dynamically by some AJAX code.

来源：https://stackoverflow.com/questions/27426768/having-trouble-accessing-xpath-attribute-with-scrapy

标签

python

xpath

web-scraping

html-parsing

scrapy