Node.js proxy, dealing with gzip DEcompression

爱⌒轻易说出口 提交于 2019-12-19 04:07:29

问题


I'm currently working on a proxy server where we in this case have to modify the data (by using regexp) that we push through it.

In most cases it works fine except for websites that use gzip as content-encoding (I think), I've come across a module called compress and tried to push the chunks that I receive through a decompress / gunzip stream but it isn't really turning out as I expected. (see below for code)

figured i'd post some code to support my prob, this is the proxy that gets loaded with mvc (express):

module.exports = {
index: function(request, response){
    var iframe_url = "www.nu.nl"; // site with gzip encoding    

    var http = require('http');     
    var httpClient = http.createClient(80, iframe_url);
    var headers = request.headers;
    headers.host = iframe_url;

    var remoteRequest = httpClient.request(request.method, request.url, headers);

    request.on('data', function(chunk) {
        remoteRequest.write(chunk);
    });

    request.on('end', function() {
        remoteRequest.end();
    });

    remoteRequest.on('response', function (remoteResponse){         
        var body_regexp = new RegExp("<head>"); // regex to find first head tag
        var href_regexp = new RegExp('\<a href="(.*)"', 'g'); // regex to find hrefs

        response.writeHead(remoteResponse.statusCode, remoteResponse.headers);

        remoteResponse.on('data', function (chunk) {
    var body = doDecompress(new compress.GunzipStream(), chunk);
            body = body.replace(body_regexp, "<head><base href=\"http://"+ iframe_url +"/\">");
            body = body.replace(href_regexp, '<a href="#" onclick="javascript:return false;"');             

            response.write(body, 'binary');
        });

        remoteResponse.on('end', function() {

            response.end();
            });
        });
    }
};

at the var body part i want to read the body and for example in this case remove all hrefs by replacing them with an #. The problem here of course is when we have an site which is gzip encoded/ compressed it's all jibberish and we can't apply the regexps.

now I've already tired to mess around with the node-compress module:

 doDecompress(new compress.GunzipStream(), chunk);

which refers to

function doDecompress(decompressor, input) {
  var d1 = input.substr(0, 25);
  var d2 = input.substr(25);

  sys.puts('Making decompression requests...');
  var output = '';
  decompressor.setInputEncoding('binary');
  decompressor.setEncoding('utf8');
  decompressor.addListener('data', function(data) {
    output += data;
  }).addListener('error', function(err) {
    throw err;
  }).addListener('end', function() {
    sys.puts('Decompressed length: ' + output.length);
    sys.puts('Raw data: ' + output);
  });
  decompressor.write(d1);
  decompressor.write(d2);
  decompressor.close();
  sys.puts('Requests done.');
}

But it fails on it since the chunk input is an object, so i tried supplying it as an chunk.toString() which also fails with invalid input data.

I was wondering if I am at all heading in the right direction?


回答1:


The decompressor expects binary encoded input. The chunk that your response receives is an instance of Buffer which toString() method does by default give you an UTF-8 encoded string back.

So you have to use chunk.toString('binary') to make it work, this can also be seen in the demo.



来源:https://stackoverflow.com/questions/4594654/node-js-proxy-dealing-with-gzip-decompression

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!