问题
I am trying to crawl a defined list of URLs with Scrapy 2.4 where each of those URLs can have up to 5 paginated URLs that I want to follow.
Now also the system works, I do have one extra request I want to get rid of:
Those pages are exactly the same but have a different URL:
example.html
example.thml?pn=1
Somewhere in my code I do this extra request and I can not figure out how to surpress it.
This is the working code:
Define a bunch of URLs to scrape:
start_urls = [
'https://example...',
'https://example2...',
]
Start requesting all start urls;
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url = url,
callback=self.parse,
)
Parse the start URL:
def parse(self, response):
url = response.url + '&pn='+str(1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(pn=1, base_url=response.url))
Go get all paginated URLs from the start URLs;
def parse_item(self, response, pn, base_url):
self.logger.info('Parsing %s', response.url)
if pn < 6: # maximum level 5
url = base_url + '&pn='+str(pn+1)
yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(base_url=base_url,pn=pn+1))
回答1:
If I understand you're question correct you just need to change to start at ?pn=1 and ignore the one without pn=null, here's an option how i would do it, which also only requires one parse method.
start_urls = [
'https://example...',
'https://example2...',
]
def start_requests(self):
for url in self.start_urls:
#how many pages to crawl
for i in range(1,6):
yield scrapy.Request(
url=url + f'&pn={str(i)}'
)
def parse(self, response):
self.logger.info('Parsing %s', response.url)
来源:https://stackoverflow.com/questions/65423455/how-to-yield-in-scrapy-without-a-request