Scrapy - Crawl and Scrape a website

让人想犯罪 __ 提交于 2020-01-01 18:54:15

问题


As a part of learning to use Scrapy, I have tried to Crawl Amazon and there is a problem while scraping data,

The output of my code is as follows:

2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
    {'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13',
              u'http://www.amazon.com/MELT-Method-Breakthrough-Self-Treatment-Eliminate/dp/0062065351/ref=sr_1_14?s=books&ie=UTF8&qid=1361774694&sr=1-14',
              u'http://www.amazon.com/Official-SAT-Study-Guide-2nd/dp/0874478529/ref=sr_1_15?s=books&ie=UTF8&qid=1361774694&sr=1-15',
              u'http://www.amazon.com/Inferno-Robert-Langdon-Dan-Brown/dp/0385537859/ref=sr_1_16?s=books&ie=UTF8&qid=1361774694&sr=1-16',
              u'http://www.amazon.com/Memory-Light-Wheel-Time/dp/0765325950/ref=sr_1_17?s=books&ie=UTF8&qid=1361774694&sr=1-17',
              u'http://www.amazon.com/Jesus-Calling-Enjoying-Peace-Presence/dp/1591451884/ref=sr_1_18?s=books&ie=UTF8&qid=1361774694&sr=1-18',
              u'http://www.amazon.com/Fifty-Shades-Grey-Book-Trilogy/dp/0345803485/ref=sr_1_19?s=books&ie=UTF8&qid=1361774694&sr=1-19',
              u'http://www.amazon.com/Fifty-Shades-Trilogy-Darker-3-/dp/034580404X/ref=sr_1_20?s=books&ie=UTF8&qid=1361774694&sr=1-20',
              u'http://www.amazon.com/Wheat-Belly-Lose-Weight-Health/dp/1609611543/ref=sr_1_21?s=books&ie=UTF8&qid=1361774694&sr=1-21',
              u'http://www.amazon.com/Publication-Manual-American-Psychological-Association/dp/1433805618/ref=sr_1_22?s=books&ie=UTF8&qid=1361774694&sr=1-22',
              u'http://www.amazon.com/One-Only-Ivan-Katherine-Applegate/dp/0061992259/ref=sr_1_23?s=books&ie=UTF8&qid=1361774694&sr=1-23',
              u'http://www.amazon.com/Inquebrantable-Spanish-Jenni-Rivera/dp/1476745420/ref=sr_1_24?s=books&ie=UTF8&qid=1361774694&sr=1-24'],
     'title': [u'ObamaCare Survival Guide',
               u'The Official SAT Study Guide, 2nd edition',
               u'Inferno: A Novel (Robert Langdon)',
               u'A Memory of Light (Wheel of Time)',
               u'Jesus Calling: Enjoying Peace in His Presence',
               u'Fifty Shades of Grey: Book One of the Fifty Shades Trilogy',
               u'Fifty Shades Trilogy: Fifty Shades of Grey, Fifty Shades Darker, Fifty Shades Freed 3-volume Boxed Set',
               u'Wheat Belly: Lose the Wheat, Lose the Weight, and Find Your Path Back to Health',
               u'Publication Manual of the American Psychological Association, 6th Edition',
               u'The One and Only Ivan',
               u'Inquebrantable (Spanish Edition)'],
     'visit_id': '2f4d045a9d6013ef4a7cbc6ed62dc111f6111633',
     'visit_status': 'new'}

But, I wanted the output to be captured like this,

2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
    {'link': [u'http://www.amazon.com/ObamaCare-Survival-Guide-Nick-Tate/dp/0893348627/ref=sr_1_13?s=books&ie=UTF8&qid=1361774694&sr=1-13'],
     'title': [u'ObamaCare Survival Guide']}

2013-02-25 12:47:21+0530 [scanon] DEBUG: Scraped from <200 http://www.amazon.com/s/ref=sr_pg_2?ie=UTF8&page=2&qid=1361774681&rh=n%3A283155>
    {'link': [u'http://www.amazon.com/Official-SAT-Study-Guide-2nd/dp/0874478529/ref=sr_1_15?s=books&ie=UTF8&qid=1361774694&sr=1-15'],
     'title': [u'The Official SAT Study Guide, 2nd edition']}

I think its not a problem with the scrapy or the crawler, but with the FOR loop written.

Following is the code,

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from Amaze.items import AmazeItem

class AmazeSpider2(CrawlSpider):
    name = "scanon"
    allowed_domains = ["www.amazon.com"]
    start_urls = ["http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=books"]

    rules = (
        Rule(SgmlLinkExtractor(allow=("ref=sr_pg_*")), callback="parse_items_1", follow= True),
        )

    def parse_items_1(self, response):
        items = []
        print ('*** response:', response.url)
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//h3')
        for title in titles:
            item = AmazeItem()
            item["title"] = title.select('//a[@class="title"]/text()').extract()
            item["link"] = title.select('//a[@class="title"]/@href').extract()
            print ('**parse-items_1:', item["title"], item["link"])
            items.append(item)
        return items

Any assistance!


回答1:


problem is in your Xpath

def parse_items_1(self, response):
        items = []
        print ('*** response:', response.url)
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//h3')
        for title in titles:
            item = AmazeItem()
            item["title"] = title.select('.//a[@class="title"]/text()').extract()
            item["link"] = title.select('.//a[@class="title"]/@href').extract()
            print ('**parse-items_1:', item["title"], item["link"])
            items.append(item)
        return items

in above Xpaths you needs to use . in xpath to look into title only other wise your xpath will look on whole page , so it will get allot of matches and will return them,




回答2:


By the way - you can test our your Xpath expressions in the Scrapy Shell - http://doc.scrapy.org/en/latest/topics/shell.html

Done right, it will save you hours of work and a headache. :)




回答3:


Use yield to make a generator and fix your xpath selectors:

def parse_items_1(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.select('//h3')

    for title in titles:
        item = AmazeItem()
        item["title"] = title.select('.//a[@class="title"]/text()').extract()
        item["link"] = title.select('.//a[@class="title"]/@href').extract()

        yield item


来源:https://stackoverflow.com/questions/15061702/scrapy-crawl-and-scrape-a-website

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!