Casperjs iterating over a list of links using casper.each

非 Y 不嫁゛ 提交于 2019-12-01 03:55:26

I decided to use our own Stackoverflow.com as a demo site to run your script against. There were a few minor things I've corrected in your code and the result is this exercise in getting comments from PhantomJS bounty questions.

var casper = require('casper').create();

casper
.start()
.open('http://stackoverflow.com/questions/tagged/phantomjs?sort=featured&pageSize=30')
.then(function () {

    var date = Date.now(), object = {};
    object[date] = {};

    var listOfLinks = this.evaluate(function(){

        // Getting links to other pages to scrape, this will be 
        // a primitive array that will be easily returned from page.evaluate
        var links = [].map.call(document.querySelectorAll("#questions .question-hyperlink"), function(link) {
          return link.href;
        });    
        return links;
    });

    // Now to iterate over that array of links
    this.each(listOfLinks, function(self, eachPageHref) {

        object[date][eachPageHref] = []; // array for page to store names

        self.thenOpen(eachPageHref, function () {

            // Getting comments from each page, also as an array
            var listOfItems = this.evaluate(function() {
                var items = [].map.call(document.getElementsByClassName("comment-text"), function(comment) {
                    return comment.innerText;
                });    
                return items;
            });
            object[date][eachPageHref] = listOfItems;
        });
    });

    // After each links has been scraped, output the resulting object
    this.then(function(){
        console.log(JSON.stringify(object));
    });
})

casper.run();

What is changed: page.evaluate now returns simple arrays, which are needed for casper.each() to correctly iterate. href attributes are extracted right away in page.evaluate. Also this correction:

 object[date][eachPageHref] = listOfItems; // previously assigned items which were undefined in this scope

The result of the script run is

{"1478596579898":{"http://stackoverflow.com/questions/40410927/phantomjs-from-node-on-windows":["en.wikipedia.org/wiki/File_URI_scheme – Igor 2 days ago\n","@Igor is there something in particular you see wrong, or are you suggesting the phantom module has an incorrect URI? – Danny Buonocore 2 days ago\n","Probably windows security issue not allowing to run an unsigned program. – Vaviloff yesterday\n"],"http://stackoverflow.com/questions/40412726/casperjs-iterating-over-a-list-of-links-using-casper-each":["Thanks, this looked really promising. I made the changes but it didn't solve the problem. And I just realised that in debug mode the following happens: Creating new array object for https://example.com [debug] [phantom] Navigation requested: url=about:blank, type=Other, willNavigate=true, isMainFrame=true and then Casperjs silently fails. It seems that the correct link that gets passed into thenOpen gets changed to about:blank... – cyc665 yesterday\n"]}}

You are returning DOM nodes in the evaluate() function, which is not allowed. You can return the actual URLs instead.

Note: The arguments and the return value to the evaluate function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.

Closures, functions, DOM nodes, etc. will not work!

Reference: PhantomJS#evaluate

If I understand your problem correctly, to solve, give items[] a global scope. In your code, I would have done the following:

var items = [];
this.each(listOfLinks, function(self, link) {

    var eachPageHref = link.href;

    console.log("Creating new array in object for " + eachPageHref);

    object[date][eachPageHref] = []; // array for page to store names

    self.thenOpen(eachPageHref, function () {

        this.evaluate(function() {
        // Perform DOM manipulation to get items
        items.push(whateverThisItemIs);
      });
    });

Hope this helps.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!