How can I use Perl regexps to extract all URLs of a specific domain (with possibly variable subdomains) with a specific extension from plain text? I have tried:
I have used following code to extract the links which ends with specific extension
like *.htm, *.html, *.gif, *.jpeg.
Note: In this script extension *.html is written first and then *.htm because both have "htm" in common. So these kind of changes should be done carefully.
Input: File name having links and Output file name where results will be saved.
Output: Will be saved in output file.
Code goes here:
use strict;
use warnings;
if ( $#ARGV != 1 ) {
print
"Incorrect number of arguments.\nArguments: Text_LinkFile, Output_File\n";
die $!;
}
open FILE_LINKS, $ARGV[0] or die $!;
open FILE_RESULT, ">$ARGV[1]" or die $!;
my @Links;
foreach () {
my @tempArray;
my (@Matches) =( $_ =~ m/((https?|ftp):\/\/[^\s]+\.(html?|gif|jpe?g))/g );
for ( my $i = 0 ; $i < $#Matches ; $i += 3 ) {
push( @Links, $Matches[$i] );
}
}
print FILE_RESULT join( "\n", @Links );
Output of your string is here:
http://homepage.com/woot.gif
http://shomepage.com/woot.gif