setting start urls for scrapy outside of class

那年仲夏 提交于 2019-12-24 06:07:38

问题


I am new Scrapy, how can I pass start_urls from outside of the class, I tried to make start_urls outside of class but it didn't work.What I am trying to do is to create a file with file name from dictionary (search_dict) and value of it as a start url for Scrapy

search_dict={'hello world':'https://www.google.com/search?q=hello+world',
            'my code':'https://www.google.com/search?q=stackoverflow+questions',
            'test':'https://www.google.com/search?q="test"'}

class googlescraper(scrapy.Spider):
    name = "test"
    allowed_domains = ["google.com"]
    #start_urls= ??
    found_items = []
    def parse:
        item=dict()
        #code here
        self.found_items.append(item)

for k,v in search_dict.items():
    with open(k,'w') as csvfile:
        process = CrawlerProcess({
            'DOWNLOAD_DELAY': 0,
            'LOG_LEVEL': 'DEBUG',
            'DOWNLOAD_TIMEOUT':30,})
        process.crawl(googlescraper) #scrapy spider needs start url
        spider = next(iter(process.crawlers)).spider
        process.start()
        dict_writer = csv.DictWriter(csvfile, keys)
        dict_writer.writeheader()
        dict_writer.writerows(spider.found_items)

回答1:


The Scrapy documentation has an example of instantiating a crawler with arguments: https://docs.scrapy.org/en/latest/topics/spiders.html#spider-arguments

You could pass in your urls in something like:

# ...

class GoogleScraper(scrapy.Spider):
    # ...
    # Omit `start_urls` in the class definition
    # ...

process.crawl(GoogleScraper, start_urls=[
    # The URL you want to pass here
])

The kwargs in the call to process.crawl() will be passed to the spider initializer. The default initializer will copy any kwargs as attributes of the spider class. So this is equivalent as setting start_urls in the class definition.

Relevant section in Scrapy docs: https://docs.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerProcess.crawl



来源:https://stackoverflow.com/questions/56435618/setting-start-urls-for-scrapy-outside-of-class

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!