How do I extract an HTML title with Perl?

亡梦爱人 提交于 2019-12-04 05:51:44

问题


Is there a way to extract HTML page title using Perl? I know it can be passed as a hidden variable during form submit and then retrieved in Perl that way but I was wondering if there is a way to do this without the submit?

Like, lets say i have an HTML page like this:

<html><head><title>TEST</title></head></html>

and then in Perl I want to do :

$q -> h1('something');

How can I replace 'something' dynamically with what is contained in <title> tags?


回答1:


I would use pQuery. It works just like jQuery.

You can say:

use pQuery;
my $page = pQuery("http://google.com/");
my $title = $page->find('title');
say "The title is: ", $title->html;

Replacing stuff is similar:

$title->html('New Title');
say "The entirety of google.com with my new title is: ", $page->html;

You can pass an HTML string to the pQuery constructor, which it sounds like you want to do.

Finally, if you want to use arbitrary HTML as a "template", and then "refine" that with Perl commands, you want to use Template::Refine.




回答2:


HTML::HeadParser does this for you.




回答3:


It's not clear to me what you are asking. You seem to be talking about something that could run in the user's browser, or at least something that already has an html page loaded.

If that's not the case, the answer is URI::Title.




回答4:


use strict;
use LWP::Simple;

my $url = 'http://www.google.com'|| die "Specify URL on the cmd line";
my $html = get ($url);
$html =~ m{<TITLE>(.*?)</TITLE>}gism;

print "$1\n";



回答5:


The previous answer is wrong, if the HTML title tag is used more often then this can easily be overcome by checking to make sure that the title tag is valid (no tags in between).

my ($title) = $test_content =~ m/<title>([a-zA-Z\/][^>]+)<\/title>/si;



回答6:


get the title name form the file.

                    my $spool = 0;

                    open my $fh, "<", $absPath or die $!; 
                    #open ($fh, "<$tempfile" );
                    # wrtie the opening brace
                    print WFL "[";
            while (<$fh>) {
                    # removes the new line from the line read
                        chomp;
                    # removes the leading and trailing spaces.
                    $_=~ s/^\s+|\s+$//g;
            # case where the <title> and </title> occures in one line
            # we print and exit in one instant
                if (($_=~/$startstring/i)&&($_=~/$endstring/i)) {

                        print WFL "'";

                    my ($title) = $_=~ m/$startstring(.+)$endstring/si;
                        print WFL "$title";
                        print WFL "',";
                        last;
                        }
            # case when the <title> is in one line and </title> is in other line

            #starting <title> string is found in the line
                elsif ($_=~/$startstring/i) {

                        print WFL "'";
            # extract everything after <title> but nothing before <title>       
                    my ($title) = $_=~ m/$startstring(.+)/si;
                        print WFL "$title";
                        $spool = 1;
                        }
            # ending string </title> is found
                elsif ($_=~/$endstring/i) {
            # read everything before </title> and nothing above that                                
                    my ($title) = $_=~ m/(.+)$endstring/si;
                        print WFL " ";
                        print WFL "$title";
                        print WFL "',";
                        $spool = 0;
                        last;
                        }
            # this will useful in reading all line between <title> and </title>
                elsif ($spool == 1) {
                        print WFL " ";
                        print WFL "$_";

                        }

                    }
        close $fh;
        # end of getting the title name



回答7:


If you just want to extract the page title you can use a regular expression. I believe that would be something like:

my ($title) = $html =~ m/<title>(.+)<\/title>/si;

where your HTML page is stored in the string $html. In si, the s stands for for single line mode (i.e., the dot also matches a newline) and i for ignore case.



来源:https://stackoverflow.com/questions/574199/how-do-i-extract-an-html-title-with-perl

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!