Scrapy: How to stop requesting in case of 302?

流过昼夜 提交于 2021-02-08 02:58:29

问题


I am using Scrapy 2.4 to crawl specific pages from a start_urls list. Each of those URLs has persumably 6 result pages, so I request them all.

In some cases however there is only 1 result page and all other paginated pages return a 302 to pn=1. In this case I do not want to follow that 302 nor do I want to continue looking for page 3,4,5,6 but rather continue to the next URL in the list.

How to exit (continue) this for loop in case of a 302/301 and how to not follow that 302?

def start_requests(self):
    for url in self.start_urls:
        for i in range(1,7): # 6 pages
            yield scrapy.Request(
                url=url + f'&pn={str(i)}'
            )

def parse(self, request):

    # parse page
    ...

    # recognize no pagination and somehow exit the for loop
    if not response.xpath('//regex'): 
        # ... continue somehow instead of going to page 2

回答1:


The main problem of your approach is that from start_requests we can't know for ahead how many valid pages exists.

Common approach for this type of cases
is to schedule requests one by one in this way istead of loop:

class somespider(scrapy.Spider):
...
    def start_requests(self):
        ...
        for u in self.start_urls:
            # schedule only first page of each "query"
            yield scrapy.Request(url=u+'&pn=1', callback=self.parse)

    def parse(self, response):
        r_url, page_number = response.url.split("&pn=")
        page_number = int(page_number)
        ....
        if next_page_exists:
            yield scrapy.Request(
            url = f'{r_url}&pn={str(page_number+1)}',
            callback = self.parse)
       else:
           # something else
           ...


来源:https://stackoverflow.com/questions/65427519/scrapy-how-to-stop-requesting-in-case-of-302

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!