Using regex to extract URLs from plain text with Perl

前端未结

关注

 7  1977

梦如初夏 2020-12-16 05:12

How can I use Perl regexps to extract all URLs of a specific domain (with possibly variable subdomains) with a specific extension from plain text? I have tried:

7条回答

再見小時候 (楼主)

2020-12-16 06:01

I have used following code to extract the links which ends with specific extension
like *.htm, *.html, *.gif, *.jpeg. Note: In this script extension *.html is written first and then *.htm because both have "htm" in common. So these kind of changes should be done carefully.

Input: File name having links and Output file name where results will be saved.
Output: Will be saved in output file.

Code goes here:

use strict;
use warnings;

if ( $#ARGV != 1 ) {
print
"Incorrect number of arguments.\nArguments: Text_LinkFile, Output_File\n";
die $!;
}
open FILE_LINKS, $ARGV[0] or die $!;
open FILE_RESULT, ">$ARGV[1]" or die $!;

my @Links;
foreach () {
    my @tempArray;
    my (@Matches) =( $_ =~ m/((https?|ftp):\/\/[^\s]+\.(html?|gif|jpe?g))/g );
    for ( my $i = 0 ; $i < $#Matches ; $i += 3 ) {
        push( @Links, $Matches[$i] );
        }
    }
print FILE_RESULT join( "\n", @Links );

Output of your string is here:

http://homepage.com/woot.gif
http://shomepage.com/woot.gif

0 讨论(0)

查看其它7个回答