How to use XMLHttpRequest to download an HTML page in the background and extract a text element from it?

百般思念 提交于 2019-11-27 09:38:17

For cross-origin requests, where the fetched site has not helpfully set a permissive CORS policy, Greasemonkey provides the GM_xmlhttpRequest() function. (Most other userscript engines also provide this function.)

GM_xmlhttpRequest is expressly designed to allow cross-origin requests.

To get your target information create a DOMParser on the result. Do not use jQuery methods as this will cause extraneous images, scripts and objects to load, slowing things down, or crashing the page.

Here's a complete script that illustrates the process:

// ==UserScript==
// @name        _Parse Ajax Response for specific nodes
// @include     http://stackoverflow.com/questions/*
// @require     http://ajax.googleapis.com/ajax/libs/jquery/2.1.0/jquery.min.js
// @grant       GM_xmlhttpRequest
// ==/UserScript==

GM_xmlhttpRequest ( {
    method: "GET",
    url:    "http://www.rottentomatoes.com/m/godfather/",
    onload: function (response) {
        var parser  = new DOMParser ();
        /* IMPORTANT!
            1) For Chrome, see
            https://developer.mozilla.org/en-US/docs/Web/API/DOMParser#DOMParser_HTML_extension_for_other_browsers
            for a work-around.

            2) jQuery.parseHTML() and similar are bad because it causes images, etc., to be loaded.
        */
        var doc         = parser.parseFromString (response.responseText, "text/html");
        var criticTxt   = doc.getElementsByClassName ("critic_consensus")[0].textContent;

        $("body").prepend ('<h1>' + criticTxt + '</h1>');
    },
    onerror: function (e) {
        console.error ('**** error ', e);
    },
    onabort: function (e) {
        console.error ('**** abort ', e);
    },
    ontimeout: function (e) {
        console.error ('**** timeout ', e);
    }
} );
Igor Barinov

The problem is: XMLHttpRequest cannot load http://www.rottentomatoes.com/m/godfather/. No 'Access-Control-Allow-Origin' header is present on the requested resource.

Because you are not the owner of the resource you can not set up this header.

What you can do is set up a proxy on heroku which will proxy all requests to rottentomatoes web site Here is a small node.js proxy https://gist.github.com/igorbarinov/a970cdaf5fc9451f8d34

var https = require('https'),
    http  = require('http'),
    util  = require('util'),
    path  = require('path'),
    fs    = require('fs'),
    colors = require('colors'),
    url = require('url'),
    httpProxy = require('http-proxy'),
    dotenv = require('dotenv');

dotenv.load();

var proxy = httpProxy.createProxyServer({});
var host = "www.rottentomatoes.com";
var port = Number(process.env.PORT || 5000);

process.env.NODE_TLS_REJECT_UNAUTHORIZED = "0";

var server = require('http').createServer(function(req, res) {
    // You can define here your custom logic to handle the request
    // and then proxy the request.
    var path = url.parse(req.url, true).path;

    req.headers.host = host;
res.setHeader("Access-Control-Allow-Origin", "*");
    proxy.web(req, res, {
        target: "http://"+host+path,

    });

}).listen(port);

proxy.on('proxyRes', function (res) {
    console.log('RAW Response from the target', JSON.stringify(res.headers, true, 2));
});


util.puts('Proxying to '+ host +'. Server'.blue + ' started '.green.bold + 'on port '.blue + port);

I modified https://github.com/massive/firebase-proxy/ code for this

I published proxy on http://peaceful-cove-8072.herokuapp.com/ and on http://peaceful-cove-8072.herokuapp.com/m/godfather you can test it

Here is a gist to test http://jsfiddle.net/uuw8nryy/

var xhr = new XMLHttpRequest();
xhr.onload = function() {
  alert(this.responseXML.getElementsByClassName(critic_consensus)[0]);
}
xhr.open("GET", "http://peaceful-cove-8072.herokuapp.com/m/godfather",true);
xhr.responseType = "document";
xhr.send();

The JavaScript same origin policy prevents you from accessing content that belongs to a different domain.

The above reference also gives you four techniques for relaxing this rule (CORS being one of them).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!