How to yield in Scrapy without a request?

梦想的初衷 提交于 2021-02-11 13:39:00

问题


I am trying to crawl a defined list of URLs with Scrapy 2.4 where each of those URLs can have up to 5 paginated URLs that I want to follow.

Now also the system works, I do have one extra request I want to get rid of:

Those pages are exactly the same but have a different URL:

example.html
example.thml?pn=1

Somewhere in my code I do this extra request and I can not figure out how to surpress it.

This is the working code:

Define a bunch of URLs to scrape:

start_urls = [
    'https://example...',
    'https://example2...',
]

Start requesting all start urls;

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(
            url = url,
            callback=self.parse,
        )

Parse the start URL:

def parse(self, response):
    url = response.url  + '&pn='+str(1)
    yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(pn=1, base_url=response.url))

Go get all paginated URLs from the start URLs;

def parse_item(self, response, pn, base_url):
    self.logger.info('Parsing %s', response.url)        
    if pn < 6: # maximum level 5
        url = base_url + '&pn='+str(pn+1)
        yield scrapy.Request(url, self.parse_item, cb_kwargs=dict(base_url=base_url,pn=pn+1))

回答1:


If I understand you're question correct you just need to change to start at ?pn=1 and ignore the one without pn=null, here's an option how i would do it, which also only requires one parse method.

start_urls = [
    'https://example...',
    'https://example2...',
]

def start_requests(self):
    for url in self.start_urls:
        #how many pages to crawl
        for i in range(1,6):
            yield scrapy.Request(
                url=url + f'&pn={str(i)}'
            )

def parse(self, response):
    self.logger.info('Parsing %s', response.url) 


来源:https://stackoverflow.com/questions/65423455/how-to-yield-in-scrapy-without-a-request

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!