How to handle connection or download error in Scrapy?

别来无恙 提交于 2019-12-06 16:26:26

Ok, I have been trying to play nice with Scrapy, trying to exit gracefully when there is no internet connection or other error. The result? I could not get it to work properly. Instead I ended up just shutting down the entire interpreter and all it's obnoxious deferred children using os._exit(0), like this:

import socket
#from scrapy.exceptions import CloseSpider
...
def check_connection(self):
    try:
        socket.create_connection(("www.google.com", 443))
        return True
    except:
        pass
    return False

def start_requests(self):
    if not self.check_connection(): 
        print('Connection Lost! Please check your internet connection!', flush=True)
        os._exit(0)                     # Kill Everything
        #CloseSpider('Grace Me!')       # Close clean but expect deferred errors!
        #raise CloseSpider('No Grace')  # Raise Exception (w. Traceback)?!
    ...

That did it!


NOTE

I tried to use various internal methods to shutdown Scrapy, and handle the obnoxious:

[scrapy.core.scraper] ERROR: Error downloading

issue. This only (?) happens when you use: raise CloseSpider('Because of Connection issues!') among many other attempts. Again followed by a twisted.internet.error.DNSLookupError, even though I have handled that in my code, it seem to appear out of nowhere. Obviously raise is the manual way to always raise an exception. So instead use the CloseSpider() without it.


The issue at hand also seem to be a re-occurring issue in the Scrapy framework...and in fact the source code has some FIXME's in there. Even when I tried to apply things like:

def stop(self):
    self.deferred = defer.Deferred()
    for name, signal in vars(signals).items():
        if not name.startswith('_'):
            disconnect_all(signal)
    self.deferred.callback(None)

and using these...

#self.stop()
#sys.exit()
#disconnect_all(signal, **kwargs)
#self.crawler.engine.close_spider(spider, 'cancelled')
#scrapy.crawler.CrawlerRunner.stop()
#crawler.signals.stop()

PS. I would be great if the Scrapy devs could document how to best deal with such a simple case as a no internet-connection?

I believe I may just have found an answer. To exit out of start_requests gracefully, return []. This is telling it there are no requests to process.

To close a spider, call the close() method on the spider: self.close('reason')

import logging
import scrapy
import socket


class SpiderIndex(scrapy.Spider):
    name = 'test'

    def check_connection(self):
        try:
            socket.create_connection(("www.google.com", 443))
            return True
        except Exception:
            pass
        return False

    def start_requests(self):
        if not self.check_connection():
            print('Connection Lost! Please check your internet connection!', flush=True)
            self.close(self, 'Connection Lost!')
            return []

        # Continue as normal ...
        request = scrapy.Request(url='https://www.google.com', callback=self.parse)
        yield request

    def parse(self, response):
        self.log(f'===TEST SPIDER: PARSE REQUEST======{response.url}===========', logging.INFO)

Addendum: For some strange reason, on one spider self.close('reason') worked while as on another I had to change it to self.close(self, 'reason').

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!