问题
I am new Scrapy, how can I pass start_urls from outside of the class,
I tried to make start_urls outside of class but it didn't work.What I am trying to do is to create a file with file name from dictionary (search_dict) and value of it as a start url for Scrapy
search_dict={'hello world':'https://www.google.com/search?q=hello+world',
'my code':'https://www.google.com/search?q=stackoverflow+questions',
'test':'https://www.google.com/search?q="test"'}
class googlescraper(scrapy.Spider):
name = "test"
allowed_domains = ["google.com"]
#start_urls= ??
found_items = []
def parse:
item=dict()
#code here
self.found_items.append(item)
for k,v in search_dict.items():
with open(k,'w') as csvfile:
process = CrawlerProcess({
'DOWNLOAD_DELAY': 0,
'LOG_LEVEL': 'DEBUG',
'DOWNLOAD_TIMEOUT':30,})
process.crawl(googlescraper) #scrapy spider needs start url
spider = next(iter(process.crawlers)).spider
process.start()
dict_writer = csv.DictWriter(csvfile, keys)
dict_writer.writeheader()
dict_writer.writerows(spider.found_items)
回答1:
The Scrapy documentation has an example of instantiating a crawler with arguments: https://docs.scrapy.org/en/latest/topics/spiders.html#spider-arguments
You could pass in your urls in something like:
# ...
class GoogleScraper(scrapy.Spider):
# ...
# Omit `start_urls` in the class definition
# ...
process.crawl(GoogleScraper, start_urls=[
# The URL you want to pass here
])
The kwargs in the call to process.crawl() will be passed to the spider initializer. The default initializer will copy any kwargs as attributes of the spider class. So this is equivalent as setting start_urls in the class definition.
Relevant section in Scrapy docs: https://docs.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerProcess.crawl
来源:https://stackoverflow.com/questions/56435618/setting-start-urls-for-scrapy-outside-of-class