scrapy run spider from script

后端 未结 4 886
野的像风
野的像风 2020-12-09 03:53

I want to run my spider from a script rather than a scrap crawl

I found this page

http://doc.scrapy.org/en/latest/topics/practices.html

相关标签:
4条回答
  • 2020-12-09 04:20

    You can just create a normal Python script, and then use Scrapy's command line option runspider, that allows you to run a spider without having to create a project.

    For example, you can create a single file stackoverflow_spider.py with something like this:

    import scrapy
    
    class QuestionItem(scrapy.item.Item):
        idx = scrapy.item.Field()
        title = scrapy.item.Field()
    
    class StackoverflowSpider(scrapy.spider.Spider):
        name = 'SO'
        start_urls = ['http://stackoverflow.com']
        def parse(self, response):
            sel = scrapy.selector.Selector(response)
            questions = sel.css('#question-mini-list .question-summary')
            for i, elem in enumerate(questions):
                l = scrapy.contrib.loader.ItemLoader(QuestionItem(), elem)
                l.add_value('idx', i)
                l.add_xpath('title', ".//h3/a/text()")
                yield l.load_item()
    

    Then, provided you have scrapy properly installed, you can run it using:

    scrapy runspider stackoverflow_spider.py -t json -o questions-items.json
    
    0 讨论(0)
  • 2020-12-09 04:30

    luckily scrapy source is open, so you can follow the way crawl command works and do the same in your code:

    ...
    crawler = self.crawler_process.create_crawler()
    spider = crawler.spiders.create(spname, **opts.spargs)
    crawler.crawl(spider)
    self.crawler_process.start()
    
    0 讨论(0)
  • 2020-12-09 04:38

    Why don't you just do this?

    from scrapy import cmdline
    
    cmdline.execute("scrapy crawl myspider".split())
    

    Put that script in the same path where you put scrapy.cfg

    0 讨论(0)
  • 2020-12-09 04:41

    It is simple and straightforward :)

    Just check the official documentation. I would make there a little change so you could control the spider to run only when you do python myscript.py and not every time you just import from it. Just add an if __name__ == "__main__":

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class MySpider(scrapy.Spider):
        # Your spider definition
        pass
    
    if __name__ == "__main__":
        process = CrawlerProcess({
            'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
        })
    
        process.crawl(MySpider)
        process.start() # the script will block here until the crawling is finished
    

    Now save the file as myscript.py and run 'python myscript.py`.

    Enjoy!

    0 讨论(0)
提交回复
热议问题