Scrapy: scraping a list of links

问题

This question is somewhat a follow-up of this question that I asked previously.

I am trying to scrape a website which contains some links on the first page. Something similar to this.

Now, since I want to scrape the details of the items present on the page I have extracted their individual URLs.

I have saved these URLS in a list.

How do I launch spiders to scrape the pages individually?

For better understanding:

[urlA, urlB, urlC, urlD...]

This is the list of URLs that I have scraped. Now I want to launch a spider to scrape the links individually.

How do I go about this?

回答1:

I'm assuming that the urls you want to follow lead to pages with the same or similar structure. If that's the case, you should do something like this:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request

class YourCrawler(CrawlSpider):

   name = 'yourCrawler'
   allowed_domains = 'domain.com'
   start_urls = ["htttp://www.domain.com/example/url"]


   def parse(self, response):
      #parse any elements you need from the start_urls and, optionally, store them as Items.
      # See http://doc.scrapy.org/en/latest/topics/items.html

      s = Selector(response)
      urls = s.xpath('//div[@id="example"]//a/@href').extract()
      for url in urls:
         yield Request(url, callback=self.parse_following_urls, dont_filter=True)


   def parse_following_urls(self, response):
       #Parsing rules go here

Otherwise, if urls you want to follow lead to pages with different structure, then you can define specific methods for them (something like parse1, parse2, parse3...).

来源：https://stackoverflow.com/questions/27984064/scrapy-scraping-a-list-of-links

标签

python

web-scraping

scrapy

scrapy-spider