Scrapy: Get Start_Urls from Database by Pipeline

问题

Unfortunately I don't have enough population to make a comment, so I have to make this new question, referring to https://stackoverflow.com/questions/23105590/how-to-get-the-pipeline-object-in-scrapy-spider

I have many urls in a DB. So I want to get the start_url from my db. So far not a big problem. Well I don't want the mysql things inside the spider and in the pipeline I get a problem. If I try to hand over the pipeline object to my spider like in the referred question, I only get an Attribute Error with the message

'None Type' object has no attribute getUrl

I think the actual problem is that the function spider_opened doesn't get called (also inserted a print statement which never showed its output in the console). Has somebody an idea how to get the pipeline object inside the spider?

MySpider.py

def __init__(self):
    self.pipe = None

def start_requests(self):
    url = self.pipe.getUrl()
    scrapy.Request(url,callback=self.parse)

Pipeline.py

@classmethod
def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
def spider_opened(self, spider):
    spider.pipe = self

def getUrl(self):
     ...

回答1:

Scrapy pipelines already have expected methods of open_spider and close_spider

Taken from docs: https://doc.scrapy.org/en/latest/topics/item-pipeline.html#open_spider

open_spider(self, spider)
This method is called when the spider is opened.
Parameters: spider (Spider object) – the spider which was opened

close_spider(self, spider)
This method is called when the spider is closed. Parameters: spider (Spider object) – the spider which was closed

However your original issue doesn't make much sense, why do you want to assign pipeline reference to your spider? That seems like a very bad idea.

What you should do is open up db and read urls in your spider itself.

from scrapy import Spider
class MySpider(Spider):
    name = 'myspider'
    start_urls = []

    @classmethod
    def from_crawler(self, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        spider.start_urls = self.get_urls_from_db()
        return spider

    def get_urls_from_db(self):
        db = # get db cursor here
        urls = # use cursor to pop your urls
        return urls

来源：https://stackoverflow.com/questions/46339263/scrapy-get-start-urls-from-database-by-pipeline

标签

python

scrapy