Scrapy middleware order

回眸只為那壹抹淺笑 提交于 2020-01-01 05:20:09

问题


Scrapy documentation says :

the first middleware is the one closer to the engine and the last is the one closer to the downloader.

To decide which order to assign to your middleware see the DOWNLOADER_MIDDLEWARES_BASE setting and pick a value according to where you want to insert the middleware. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied

I'm not entirely clear from this whether a higher value would result in a middleware getting executed first or vice versa.

E.g.

'myproject.middlewares.MW1': 543,
'myproject.middlewares.MW2': 542,

Question :

  1. Which of these will be executed first? My trial says that MW2 would be first.
  2. What's the valid range for the orders ? 0 - 999 ?

回答1:


  1. Which of these will be executed first? My trial says that MW2 would be first.

As you quoted the docs:

the first middleware is the one closer to the engine and the last is the one closer to the downloader.

So downloader middleware with value of 542 is executed before the middleware with value 543. It means first myproject.middlewares.MW1.process_request(request, spider) is called, and after it altered (if needed) the request, it is passed to the next downloader middleware.

  1. What's the valid range for the orders ? 0 - 999 ?

The value is an integer.

UPDATE:

Look at the architecture.

Also, the full quote:

The DOWNLOADER_MIDDLEWARES setting is merged with the DOWNLOADER_MIDDLEWARES_BASE setting defined in Scrapy (and not meant to be overridden) and then sorted by order to get the final sorted list of enabled middlewares: the first middleware is the one closer to the engine and the last is the one closer to the downloader.

So, as the values are integers, they have range of Python integers.




回答2:


I know this has been answered, but really it's a more complicated thing -- requests and responses are handled in opposite order.

you can think of it like this:

  • 0 - engine makes request
  • 1..inf - process_request middleware calls
  • inf - actual download happens (if a request middleware didn't handle it)
  • inf..1 - process_resonse middleware calls
  • 0 - response received by the engine

so ... if i tag my middleware as number 1 it will be the FIRST request middleware executed and the LAST response middleware executed ... if my middleware as 901 it will be the LAST request middleware executed and the FIRST response middleware executed (if only the default middleware is defined).

really the answer is that it IS confusing. the start of the request is nearest the engine (at zero) and the end of the request is nearest the downloader (high number). the start of the response is nearest the downloader (high number) and the end of the response is nearest the engine (at zero). it's like a trip out and back from the engine ... here's the relevant code from scrapy that makes this all so fun (with init copied from MiddlewareManager for reference and only the relevant method included):

class DownloaderMiddlewareManager(MiddlewareManager):
    def __init__(self, *middlewares):
        self.middlewares = middlewares
        self.methods = defaultdict(list)
        for mw in middlewares:
            self._add_middleware(mw)

    def _add_middleware(self, mw):
        if hasattr(mw, 'process_request'):
            self.methods['process_request'].append(mw.process_request)
        if hasattr(mw, 'process_response'):
            self.methods['process_response'].insert(0, mw.process_response)
        if hasattr(mw, 'process_exception'):
            self.methods['process_exception'].insert(0, mw.process_exception)

As you can see, request methods are appeneded in sorted order (higher number added to the back) and response and exception methods are inserted at the beginning (higher number is first).



来源:https://stackoverflow.com/questions/6623470/scrapy-middleware-order

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!