How can I get started with web page scraping using Perl?

问题

I am interested in learning Perl. I am using Learning Perl books and cpan's web-sites for reference.

I am looking forward to do some web/text scraping application using Perl to apply whatever I have learnt.

Please suggest me some good options to begin with.

(this is not a homework. want to do something in Perl that would help me exploit basic Perl features)

回答1:

If the web pages you want to scrape require JavaScript to function properly, you are going to need more than what WWW::Mechanize can provide you. You might even have to resort to controlling a specific browser via Perl (e.g. using Win32::IE::Mechanize or WWW::Mechanize::Firefox).

I haven't tried it, but there is also WWW::Scripter with the WWW::Scripter::Plugin::JavaScript plugin.

回答2:

The most popular web scraping module for Perl is WWW::Mechanize, which is excellent if you can't just retrieve your destination page but need to navigate to it using links or forms, for instance, to log in. Have a look at its documentation for inspiration. If your needs are simple, you can extract the information you need from the HTML using regular expressions (but beware your sanity), otherwise it might be better to use a module such as HTML::TreeBuilder to do the job.

A module that seems interesting, but that I haven't really tried yet, is WWW::Scripter. It's a subclass of WWW::Mechanize, but has support for Javascript and AJAX, and also integrates HTML::DOM, another way to extract information from the page.

回答3:

As others have said, WWW::Mechanize is an excellent module to use for web scraping tasks; you'll do well to learn how to use it, it can make common tasks very easy. I've used it for several web scraping tasks, and it just takes care of all the boring stuff - "go here, find a link with this text and follow it, now find a form with fields named 'username' and 'password', enter these values and submit the form...".

Scrappy is also well worth a look - it lets you do a lot with very little code - an example from its documentation:


    my $spidy = Scrappy->new;

    $spidy->crawl('http://search.cpan.org/recent', {
        '#cpansearch li a' => sub {
            print shift->text, "\n";
        }
    });

Scrappy makes use of Web::Scraper under the hood, which you might want to look at too as another option.

Also, if you need to extract data from HTML tables, HTML::TableExtract makes this dead easy - you can locate the table you're interested in by naming the headings it contains, and extract data very easily indeed, for example:


    use HTML::TableExtract;
    $te = HTML::TableExtract->new( headers => [qw(Date Price Cost)] );
    $te->parse($html_string) or die "Didn't find table";
    foreach $row ($te->rows) {
        print join(',', @$row), "\n";
    }

回答4:

Try the Web-Scraper Perl module. A beginners tutorial can be found here.

It's safe, easy to use and fast.

回答5:

You may also want to have a look at my new Perl wrapper over Java HtmlUnit. It is very easy to use, e.g. look at the quick tutorial here:

http://code.google.com/p/spidey/wiki/QuickTutorial

By tomorrow I will publish some detailed installation instructions and a first release. Unlike Mechanize and alike you get some JavaScript support and it is way faster and less memory demanding than screen scraping.

来源：https://stackoverflow.com/questions/4861319/how-can-i-get-started-with-web-page-scraping-using-perl

标签

perl

project

web-scraping