Using regex to extract URLs from plain text with Perl

前端 未结 7 1977
梦如初夏
梦如初夏 2020-12-16 05:12

How can I use Perl regexps to extract all URLs of a specific domain (with possibly variable subdomains) with a specific extension from plain text? I have tried:



        
7条回答
  •  再見小時候
    2020-12-16 06:01

    I have used following code to extract the links which ends with specific extension
    like *.htm, *.html, *.gif, *.jpeg. Note: In this script extension *.html is written first and then *.htm because both have "htm" in common. So these kind of changes should be done carefully.

    Input: File name having links and Output file name where results will be saved.
    Output: Will be saved in output file.

    Code goes here:

    use strict;
    use warnings;
    
    if ( $#ARGV != 1 ) {
    print
    "Incorrect number of arguments.\nArguments: Text_LinkFile, Output_File\n";
    die $!;
    }
    open FILE_LINKS, $ARGV[0] or die $!;
    open FILE_RESULT, ">$ARGV[1]" or die $!;
    
    my @Links;
    foreach () {
        my @tempArray;
        my (@Matches) =( $_ =~ m/((https?|ftp):\/\/[^\s]+\.(html?|gif|jpe?g))/g );
        for ( my $i = 0 ; $i < $#Matches ; $i += 3 ) {
            push( @Links, $Matches[$i] );
            }
        }
    print FILE_RESULT join( "\n", @Links );
    

    Output of your string is here:

    http://homepage.com/woot.gif
    http://shomepage.com/woot.gif
    

提交回复
热议问题