scrapy scrap on all pages that have this syntax

旧街凉风 提交于 2019-12-21 06:57:39

问题


I want to scrapy on all pages that have this syntaxt

mywebsite/?page=INTEGER

I tried this:

start_urls = ['MyWebsite']
rules = [Rule(SgmlLinkExtractor(allow=['/\?page=\d+']), 'parse')]

but it seems that the link still MyWebsite. so please what should I do to make it understand that i want to add /?page=NumberOfPage ? please?

edit

i mean that i want to scrap these pages:

mywebsite/?page=1
mywebsite/?page=2
mywebsite/?page=3
mywebsite/?page=4
mywebsite/?page=5
..
..
..
mywebsite/?page=7677654

my code

start_urls = [
        'http://example.com/?page=%s' % page for page in xrange(1,100000)
    ]
def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('my xpath')
    for site in sites:

        DateDifference= site.xpath('xpath for date difference').extract()[0]

        if DateDifference.days < 8:
            yield Request(Link, meta={'date': Date}, callback = self.crawl)

I want to get all the data of pages that have been added in the last 7 days. I don't know how many pages have been added in the last 7 days. so i think that i can crawl on a larg number of pages , lets say 100000, then i check the datedifference if it is less that 7 days i want to yield if not i want to stop crawling at all.


回答1:


If I get it right, you want to crawl all pages that are younger than 7 days. One way to do it is to follow each page in order (assuming page n°1 is the youngest, n°2 is older than n°1, n°3 older than n°2...).

You can do something like

start_urls = ['mywebsite/?page=1']

def parse(self, response):
    sel = Selector(response)
    DateDifference= sel.xpath('xpath for date difference').extract()[0]

    i = response.meta['index'] if 'index' in response.meta else 1

    if DateDifference.days < 8:
        yield Request(Link, meta={'date': Date}, callback = self.crawl)
        i += 1
        yield Request('mywebsite/?page='+str(i), meta={'index':i}, callback=self.parse)

The idea is to execute parse sequentially. If this is the first time you enter the function, response.meta['index'] isn't defined: the index is 1. If this is a call after we already parsed another page, response.meta['index'] is defined: the index indicates the number of the page currently scraped.




回答2:


CrawlSpider with rules will not help in this cases. Rules are used to extract links from the first page which match your patterns. Obviously your start url page doesn't have links to all those pages, that's why you don't get them.

Something like this should work:

class MyWebsiteSpider(Spider):
    ...

    def start_requests(self):
        for i in xrange(7677654):
            yield self.make_requests_from_url('mywebsite/?page=%d' % i)


来源:https://stackoverflow.com/questions/21170777/scrapy-scrap-on-all-pages-that-have-this-syntax

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!