Using regex to extract URLs from plain text with Perl

前端 未结 7 1960
梦如初夏
梦如初夏 2020-12-16 05:12

How can I use Perl regexps to extract all URLs of a specific domain (with possibly variable subdomains) with a specific extension from plain text? I have tried:



        
7条回答
  •  攒了一身酷
    2020-12-16 05:50

    Here is a regex to (hopefully) get|extract|obtain all URLs from string|text file, that seems to be working for me:

    m,(http.*?://([^\s)\"](?!ttp:))+),g
    

    ... or in an example:

    $ echo -e "\n\na blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah \"https://poi.com/a%20b\"; (http://bbb.comhttp://roch.com/abc) \n" | perl -ne 'while ( my $string = <> ) { print "$string\n"; while ( $string =~ m,(http.*?://([^\s)\"](?!ttp:))+),g ) {print "$&\n"} }'
    
    
    a blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah "https://poi.com/a%20b"; (http://bbb.comhttp://roch.com/abc) 
    
    http://www.abc.com/dss.htm?a=1&p=2#chk
    https://poi.com/a%20b
    http://bbb.com
    http://roch.com/abc
    

    For my noob reference, here is the debug version of the same command above:

    $ echo -e "\n\na blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah \"https://poi.com/a%20b\"; (http://bbb.comhttp://roch.com/abc) \n" | perl -dne 'use re "debug" ; while ( my $string = <> ) { print "$string\n"; while ( $string =~ m,(http.*?://([^\s)\"](?!ttp:))+),g ) {print "$&\n"} }'
    

    The regex matches on http(s):// - and uses whitespace, " and ) as "exit" characters; then uses positive lookahead to, initially, cause an "exit" on "http" literal group (if a match is already in progress); however, since that also "eats" the last character of previous match, here the lookahead match is moved one character forward to "ttp:".

    Some useful pages:

    • perl: multiple matches on a single line? (edited for proper < > forma
    • regular expression negate a word (not character)
    • Perl Regular Expressions
    • Perl Text Patterns for Search and Replace (intro, $&, @- ... )

    Hope this helps someone,
    Cheers!

    EDIT: Ups, just found about URI::Find::Simple - search.cpan.org, seems to do the same thing (via regex - Getting the website title from a link in a string)

提交回复
热议问题