I am writing a crawler to get the names of items from an website. The website has got 25 items per page and multiple pages (200 for some item types).
Here is the code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from lonelyplanet.items import LonelyplanetItem
class LonelyplanetSpider(CrawlSpider):
name = "lonelyplanetItemName_spider"
allowed_domains = ["lonelyplanet.com"]
def start_requests(self):
for i in xrange(8):
yield self.make_requests_from_url("http://www.lonelyplanet.com/europe/sights?page=%d" % i)
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//h2')
items = []
for site in sites:
item = LonelyplanetItem()
item['name'] = site.select('a[@class="targetUrl"]/text()').extract()
items.append(item)
return items
When I run the crawler and store the data in csv format the data is not stored in order, i.e. - page 2 data is stored before page 1 or page 3 gets stored before page 2 and similarly. Also sometimes before all the data of a particular page is stored the data of another page comes in and them the rest of the data of the former page is stored again.
scrapy is an asynchronous framework. It uses non-blocking IO, so it doesn't wait for a request to finish before starting the next one.
And since multiple requests can be made at a time, it is impossible to know the exact order the parse() method will be getting the responses.
My point is, scrapy is not meant to extract data in a particular order. If you absolutely need to preserve order, there are some ideas here: Scrapy Crawl URLs in Order
来源:https://stackoverflow.com/questions/11049088/scrapy-not-crawling-subsequent-pages-in-order