Can I get the original page source (vs current DOM) with phantomjs/casperjs?

前端 未结 3 690
栀梦
栀梦 2021-01-05 04:01

I am trying to get the original source for a particular web page.

The page executes some scripts that modify the DOM as soon as it loads. I would like to get the sou

3条回答
  •  余生分开走
    2021-01-05 04:36

    Hum, did you try using some events? For example :

    casper.on('load.started', function(resource) {
        casper.echo(casper.getPageContent());
    });
    

    I think it won't work, try it anyway.

    The problem is : you can't do it in a normal casperJS step because the scripts on your page are already executed. It could work if we could bind the on-DOM-Ready event, or have a specific casper event like that. Problem : the page must be loaded to send some js from Casper to the DOM environment. So binding onready isn't possible (I don't see how). I think with phantom we can scrape DATA after the load event, so only when the page is rendered.

    So if it's not possible to hack it with the events and maybe some delay, your only solution is to block the scripts which modify your DOM.

    There is still the phantomJS option, you use it : in casper :

    casper.pageSettings.javascriptEnabled = false;
    

    The problem is you need the js enabled to get back the data, so it can't work... :p Yeah useless comment ! :)

    Otherwise you have to block the wanted ressource/script which modify the DOM using events.

    Or you could use the resource.received event to scrape the data wanted before the specific resources modifing DOM appear.

    In fact I don't think it's possible because if you create a step which get back some data from page just before specific ressources appear, the time your step is executed, the ressources will have load. It would be necessary to freeze the following ressources while your step is scraping the data.

    Don't know how to do it though, but these events could help you :

    casper.on('resource.requested', function(request) {
        console.log(" request " + request.url);
    });
    
    casper.on('resource.received', function(resource) {
        console.log(resource.url);
    });
    
    casper.on('resource.error',function (request) {
        this.echo('[res : id and url + error description] <-- ' + request.id + ' ' + request.url + ' ' + request.errorString);
    });
    

    See also How do you Disable css in CasperJS?. The solution which would work : you identify the scripts and block them. But if you need them, well I don't know, it's a good question. Maybe we could defer the execution of a specific script. I don't think Casper and phantom easily permit that.The only useful option is abort(), give us this option : timeout("time -> ms") !

    onResourceRequested

    Here a similar question : Injecting script before other

提交回复
热议问题