Python Scrapy - mimetype based filter to avoid non-text file downloads

杀马特。学长 韩版系。学妹 提交于 2019-12-22 06:47:24

问题


I have a running scrapy project, but it is being bandwidth intensive because it tries to download a lot of binary files (zip, tar, mp3, ..etc).

I think the best solution is to filter the requests based on the mimetype (Content-Type:) HTTP header. I looked at the scrapy code and found this setting:

DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'

I changed it to: DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.webclients.ScrapyHTTPClientFactory'

And played a little with the ScrapyHTTPPageGetter, here is the edits highlighted:

class ScrapyHTTPPageGetter(HTTPClient):
    # this is my edit
    def handleEndHeaders(self):
        if 'Content-Type' in self.headers.keys():
            mimetype = str(self.headers['Content-Type'])
            # Actually I need only the html, but just in 
            # case I've preserved all the text
            if mimetype.find('text/') > -1: 
                # Good, this page is needed
                self.factory.gotHeaders(self.headers)
            else:
                self.factory.noPage(Exception('Incorrect Content-Type'))

I feel this is wrong, I need more scrapy friendly way to cancel/drop request right after determining that it's unwanted mimetype. Instead of waiting for the whole data to be downloaded.

Edit:
I'm asking specifically for this part self.factory.noPage(Exception('Incorrect Content-Type')) is that the correct way to cancel a request.

Update 1:
My current setup have crashed the Scrapy server, so please don't try to use the same code above to solve the problem.

Update 2:
I have setup an Apache-based website for testing using the following structure:

/var/www/scrapper-test/Zend -> /var/www/scrapper-test/Zend.zip (symlink)
/var/www/scrapper-test/Zend.zip

I have noticed that Scrapy discards the ones with the .zip extension, but scraps the one without .zip even though it's just a symbolic link to it.


回答1:


I built this Middleware to exclude any response type that isn't in a whitelist of regular expressions:

from scrapy.http.response.html import HtmlResponse
from scrapy.exceptions import IgnoreRequest
from scrapy import log
import re

class FilterResponses(object):
    """Limit the HTTP response types that Scrapy dowloads."""

    @staticmethod
    def is_valid_response(type_whitelist, content_type_header):
        for type_regex in type_whitelist:
            if re.search(type_regex, content_type_header):
                return True
        return False

    def process_response(self, request, response, spider):
        """
        Only allow HTTP response types that that match the given list of 
        filtering regexs
        """
        # each spider must define the variable response_type_whitelist as an
        # iterable of regular expressions. ex. (r'text', )
        type_whitelist = getattr(spider, "response_type_whitelist", None)
        content_type_header = response.headers.get('content-type', None)
        if not type_whitelist:
            return response
        elif not content_type_header:
            log.msg("no content type header: {}".format(response.url), level=log.DEBUG, spider=spider)
            raise IgnoreRequest()
        elif self.is_valid_response(type_whitelist, content_type_header):
            log.msg("valid response {}".format(response.url), level=log.DEBUG, spider=spider)
            return response
        else:
            msg = "Ignoring request {}, content-type was not in whitelist".format(response.url)
            log.msg(msg, level=log.DEBUG, spider=spider)
            raise IgnoreRequest()

To use it, add it to settings.py:

DOWNLOADER_MIDDLEWARES = {
    '[project_name].middlewares.FilterResponses': 999,
}



回答2:


May be it is to late. You can use the Accept header to filter the data that you are looking for.




回答3:


The solution is to setup a Node.js proxy and configure Scrapy to use it through http_proxy environment variable.

What the proxy should do is:

  • Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept all HTTP traffic.
  • For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response. This helps to save time, traffic and Scrapy won't crash.

Sample Proxy Code

That actually works!

http.createServer(function(clientReq, clientRes) {
    var options = {
        host: clientReq.headers['host'],
        port: 80,
        path: clientReq.url,
        method: clientReq.method,
        headers: clientReq.headers
    };


    var fullUrl = clientReq.headers['host'] + clientReq.url;

    var proxyReq = http.request(options, function(proxyRes) {
        var contentType = proxyRes.headers['content-type'] || '';
        if (!contentType.startsWith('text/')) {
            proxyRes.destroy();            
            var httpForbidden = 403;
            clientRes.writeHead(httpForbidden);
            clientRes.write('Binary download is disabled.');
            clientRes.end();
        }

        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
        proxyRes.pipe(clientRes);
    });

    proxyReq.on('error', function(e) {
        console.log('problem with clientReq: ' + e.message);
    });

    proxyReq.end();

}).listen(8080);


来源:https://stackoverflow.com/questions/13401382/python-scrapy-mimetype-based-filter-to-avoid-non-text-file-downloads

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!