问题
I am using Scrapy 2.4 to crawl specific pages from a start_urls list. Each of those URLs has persumably 6 result pages, so I request them all.
In some cases however there is only 1 result page and all other paginated pages return a 302 to pn=1. In this case I do not want to follow that 302 nor do I want to continue looking for page 3,4,5,6 but rather continue to the next URL in the list.
How to exit (continue) this for loop in case of a 302/301 and how to not follow that 302?
def start_requests(self):
for url in self.start_urls:
for i in range(1,7): # 6 pages
yield scrapy.Request(
url=url + f'&pn={str(i)}'
)
def parse(self, request):
# parse page
...
# recognize no pagination and somehow exit the for loop
if not response.xpath('//regex'):
# ... continue somehow instead of going to page 2
回答1:
The main problem of your approach is that from start_requests
we can't know for ahead how many valid pages exists.
Common approach for this type of cases
is to schedule requests one by one in this way istead of loop:
class somespider(scrapy.Spider):
...
def start_requests(self):
...
for u in self.start_urls:
# schedule only first page of each "query"
yield scrapy.Request(url=u+'&pn=1', callback=self.parse)
def parse(self, response):
r_url, page_number = response.url.split("&pn=")
page_number = int(page_number)
....
if next_page_exists:
yield scrapy.Request(
url = f'{r_url}&pn={str(page_number+1)}',
callback = self.parse)
else:
# something else
...
来源:https://stackoverflow.com/questions/65427519/scrapy-how-to-stop-requesting-in-case-of-302