问题
Suppose content of HTML pages is
<a href="abc.com"><b>ABC</b>industry</a>
<a href="google.com">ABC Search</a>
<a href="abc.com">Movies with<b>ABC</b></a>
I want to extract only links that contain bold text. How can i do it using WWW::Mechanize?
Output
ABC industry
Movies with ABC
I used
@arr=$m->links();
foreach(@arr){print $_->text;}
but this finds all URLs in the page.
回答1:
Without using extra modules that can parse the contents of the page, it's going to be difficult to achieve your goal with WWW::Mechanize
. However, there are other modules that will allow you to achieve this very easily.
Here is an example using Mojo::DOM, which uses lets you select elements as you would do in CSS. The Mojolicious distribution also contains Mojo::UserAgent, so you could migrate your code over to Mojo fairly easily if you are not too tied to WWW::Mechanize
.
# $html is the content of the page
my $dom = Mojo::DOM->new($html);
# extract all <b> elements that are under <a> elements (at any depth beneath the <a>)
# and get the <a> ancestors of those elements
# creates a Mojo::Collection object
my $collection = $dom->find('a b')->map(sub{ return $_->ancestors('a') } )->flatten;
$collection->each( sub {
say "LINK: " . $_->all_text;
} );
# Use a sub to perform an action on each of the retrieved <a> elements:
$dom->find('a b')->each( sub {
$_->ancestors('a')->each( sub {
say "All in one: " . $_->all_text
} )
} );
Here's a demonstration with a sample list of links:
<html>
<ul><li><a href="abc.com"><b>ABC</b> industry</a></li>
<li><a href="google.com">ABC Search</a></li>
<li>Here is <a href="#">a link
<span>with a span
<b>and a "b" tag</b>
even though
</span> "b" tags are deprecated.</a> Yay!</li>
<li><a href="abc.com">Movies with <b>ABC</b></a></li></ul></html>
Output:
LINK: ABC industry
LINK: a link with a span and a "b" tag even though "b" tags are deprecated.
LINK: Movies with ABC
All in one: ABC industry
All in one: a link with a span and a "b" tag even though "b" tags are deprecated.
All in one: Movies with ABC
If you use Mojo::UserAgent
instead of WWW::Mechanize
your search can be even easier. Mojo::UserAgent
can get
a page (just like WWW::Mechanize
), and the DOM of the returned page can be accessed using $ua->get($url)->res->dom
. You can then chain your query above on this, to give the following:
my $ua = Mojo::UserAgent->new();
# get the page and find the links with a <b> element in them:
$ua->get('http://my-url-here.com')
->res->dom('a b')->each( sub { $_->ancestors('a')->each( sub { say $_->all_text } ) } );
# example using this page:
# print the contents of divs with class 'spacer' that contain a link with a div in it:
$ua->get('http://stackoverflow.com/questions/26353298/find-links-containing-bold-text-using-wwwmechanize')
->res->dom('a div')->each( sub {
$_->ancestors('div.spacer')->each( sub {
say $_->all_text
} )
} );
Output:
1 How to use WWW::Mechanize to submit a form which isn't there in HTML?
0 How to process a simple loop in Perl's WWW::Mechanize?
0 Perl WWW::Mechanize cookie problem
1 Getting error in accessing a link using WWW::Mechanize
0 How to use output from WWW::Mechanize?
-2 Use WWW::Mechanize to login in webpage without form login but javascript using perl
3 Perl WWW::Mechanize Web Spider. How to find all links
0 Howto use WWW::Mechanize to access pages split by drop-down list
0 What is the best way to extract unique URLs and related link text via perl mechanize?
0 Perl WWW::Mechanize doesn't print results when reading input data from a data file
There are lots of examples in the Mojolicious documentation in case this isn't immediately comprehensible!
For a helpful 8 minute introductory video to Mojo::DOM
and Mojo::UserAgent
check out Mojocast Episode 5.
来源:https://stackoverflow.com/questions/26353298/find-links-containing-bold-text-using-wwwmechanize