Python Scrapy - mimetype based filter to avoid non-text file downloads

问题

I have a running scrapy project, but it is being bandwidth intensive because it tries to download a lot of binary files (zip, tar, mp3, ..etc).

I think the best solution is to filter the requests based on the mimetype (Content-Type:) HTTP header. I looked at the scrapy code and found this setting:

DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'

I changed it to: DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.webclients.ScrapyHTTPClientFactory'

And played a little with the ScrapyHTTPPageGetter, here is the edits highlighted:

class ScrapyHTTPPageGetter(HTTPClient):
    # this is my edit
    def handleEndHeaders(self):
        if 'Content-Type' in self.headers.keys():
            mimetype = str(self.headers['Content-Type'])
            # Actually I need only the html, but just in 
            # case I've preserved all the text
            if mimetype.find('text/') > -1: 
                # Good, this page is needed
                self.factory.gotHeaders(self.headers)
            else:
                self.factory.noPage(Exception('Incorrect Content-Type'))

I feel this is wrong, I need more scrapy friendly way to cancel/drop request right after determining that it's unwanted mimetype. Instead of waiting for the whole data to be downloaded.

Edit:
I'm asking specifically for this part self.factory.noPage(Exception('Incorrect Content-Type')) is that the correct way to cancel a request.

Update 1:
My current setup have crashed the Scrapy server, so please don't try to use the same code above to solve the problem.

Update 2:
I have setup an Apache-based website for testing using the following structure:

/var/www/scrapper-test/Zend -> /var/www/scrapper-test/Zend.zip (symlink)
/var/www/scrapper-test/Zend.zip

I have noticed that Scrapy discards the ones with the .zip extension, but scraps the one without .zip even though it's just a symbolic link to it.

回答1:

I built this Middleware to exclude any response type that isn't in a whitelist of regular expressions:

from scrapy.http.response.html import HtmlResponse
from scrapy.exceptions import IgnoreRequest
from scrapy import log
import re

class FilterResponses(object):
    """Limit the HTTP response types that Scrapy dowloads."""

    @staticmethod
    def is_valid_response(type_whitelist, content_type_header):
        for type_regex in type_whitelist:
            if re.search(type_regex, content_type_header):
                return True
        return False

    def process_response(self, request, response, spider):
        """
        Only allow HTTP response types that that match the given list of 
        filtering regexs
        """
        # each spider must define the variable response_type_whitelist as an
        # iterable of regular expressions. ex. (r'text', )
        type_whitelist = getattr(spider, "response_type_whitelist", None)
        content_type_header = response.headers.get('content-type', None)
        if not type_whitelist:
            return response
        elif not content_type_header:
            log.msg("no content type header: {}".format(response.url), level=log.DEBUG, spider=spider)
            raise IgnoreRequest()
        elif self.is_valid_response(type_whitelist, content_type_header):
            log.msg("valid response {}".format(response.url), level=log.DEBUG, spider=spider)
            return response
        else:
            msg = "Ignoring request {}, content-type was not in whitelist".format(response.url)
            log.msg(msg, level=log.DEBUG, spider=spider)
            raise IgnoreRequest()

To use it, add it to settings.py:

DOWNLOADER_MIDDLEWARES = {
    '[project_name].middlewares.FilterResponses': 999,
}

回答2:

May be it is to late. You can use the Accept header to filter the data that you are looking for.

回答3:

The solution is to setup a Node.js proxy and configure Scrapy to use it through http_proxy environment variable.

What the proxy should do is:

Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept all HTTP traffic.
For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response. This helps to save time, traffic and Scrapy won't crash.

Sample Proxy Code

That actually works!

http.createServer(function(clientReq, clientRes) {
    var options = {
        host: clientReq.headers['host'],
        port: 80,
        path: clientReq.url,
        method: clientReq.method,
        headers: clientReq.headers
    };


    var fullUrl = clientReq.headers['host'] + clientReq.url;

    var proxyReq = http.request(options, function(proxyRes) {
        var contentType = proxyRes.headers['content-type'] || '';
        if (!contentType.startsWith('text/')) {
            proxyRes.destroy();            
            var httpForbidden = 403;
            clientRes.writeHead(httpForbidden);
            clientRes.write('Binary download is disabled.');
            clientRes.end();
        }

        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
        proxyRes.pipe(clientRes);
    });

    proxyReq.on('error', function(e) {
        console.log('problem with clientReq: ' + e.message);
    });

    proxyReq.end();

}).listen(8080);

来源：https://stackoverflow.com/questions/13401382/python-scrapy-mimetype-based-filter-to-avoid-non-text-file-downloads

标签

python

twisted

mime-types

scrapy