Web Crawler with JavaScript support in Perl?

前端 未结 5 1251
余生分开走
余生分开走 2020-12-10 09:39

I want to code a perl application that would crawl some websites and collect images and links from such w

相关标签:
5条回答
  • 2020-12-10 09:53

    Check the complete working example featured in the Scraping pages full of JavaScript. It uses Web::Scraper for HTML processing and Gtk3::WebKit to process dynamic content. However, the later one is quite a PITA to install. If there are not-that-many pages you need to scrape (< 1000), fetching the post-processed DOM content through PhantomJS is an interesting option. I've written the following script for that purpose:

    var page = require('webpage').create(),
        system = require('system'),
        fs = require('fs'),
        address, output;
    
    if (system.args.length < 3 || system.args.length > 5) {
        console.log('Usage: phantomjs --load-images=no html.js URL filename');
        phantom.exit(1);
    } else {
        address = system.args[1];
        output = system.args[2];
        page.open(address, function (status) {
            if (status !== 'success') {
                console.log('Unable to load the address!');
            } else {
                fs.write(output, page.content, 'w');
            }
            phantom.exit();
        });
    }
    

    There's something like that on the CPAN already, it's a module called Wight, but I haven't tested it yet.

    0 讨论(0)
  • 2020-12-10 10:02

    WWW::Scripter with the WWW::Scripter::Plugin::JavaScript and WWW::Scripter::Plugin::Ajax plugins seems like the closest you'll get without using an actual browser (the modules WWW::Selenium, Mozilla::Mechanize or Win32::IE::Mechanize use real browsers).

    0 讨论(0)
  • 2020-12-10 10:10

    WWW::Mechanize::Firefox can use with mozrepl, with all javascript action.

    0 讨论(0)
  • 2020-12-10 10:12

    Options that spring to mind:

    • You could have Perl use Selenium and have a full-blown browser do the work for you.

    • You can download and compile V8 or another open source JavaScript engine and have Perl call an external program to evaluate the JavaScript.

    • I don't think Perl's LWP module supports JavaScript, but you might want to check that if you haven't done so already.

    0 讨论(0)
  • There are several options.

    • Win32::IE::Mechanize on Windows
    • Mozilla::Mechanize
    • WWW::Mechanize::Firefox
    • WWW::Selenium
    • Wight
    0 讨论(0)
提交回复
热议问题