Getting the website title from a link in a string

问题

string: "Here is the badges, https://stackoverflow.com/badges bla bla bla"

If string contatins a link (see above) I want to parse the website title of that link.

It should return : Badges - Stack Overflow.

How can i do that?

Thanks.

回答1:

#!/usr/bin/perl -w

require LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $response = $ua->get('http://search.cpan.org/');

if ($response->is_success) {
    print $response->title();
}
else {
    die $response->status_line;
}

See LWP::UserAgent. Cheers :-)

回答2:

I use URI::Find::Simple's list_uris method and URI::Title for this.

回答3:

Depending how the link is given and how you define title, you need one or other approach.

In the exact scenario that you have presented, getting the URL with URI::Find, HTML::LinkExtractor etc, and then my $title=URI->new($link)->path() will provide the title and the link.

But if the website title is the linked text like <a href="https://stackoverflow.com/badges"> badged</a>, then How can I extract URL and link text from HTML in Perl? will give you the answer.

If the title is encoded in the link itself and the link is the text itself of the link, how do you define the title?

Do you want the last bit of the URI before any query? What happens with the queries set as URL paths?
Do you want the part between the host and the query?
Do you want to parse the link source and retrieve the title tag if any?

As always going from trivial first implementation to cover all corner cases is a daunting tasks ;-)

来源：https://stackoverflow.com/questions/5532584/getting-the-website-title-from-a-link-in-a-string

标签

regex

perl

html-parsing