Using regex to extract URLs from plain text with Perl

前端未结

关注

 7  1960

梦如初夏 2020-12-16 05:12

How can I use Perl regexps to extract all URLs of a specific domain (with possibly variable subdomains) with a specific extension from plain text? I have tried:

7条回答

攒了一身酷 (楼主)

2020-12-16 05:50
Here is a regex to (hopefully) get|extract|obtain all URLs from string|text file, that seems to be working for me:
```
m,(http.*?://([^\s)\"](?!ttp:))+),g
```
... or in an example:
```
$ echo -e "\n\na blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah \"https://poi.com/a%20b\"; (http://bbb.comhttp://roch.com/abc) \n" | perl -ne 'while ( my $string = <> ) { print "$string\n"; while ( $string =~ m,(http.*?://([^\s)\"](?!ttp:))+),g ) {print "$&\n"} }'


a blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah "https://poi.com/a%20b"; (http://bbb.comhttp://roch.com/abc) 

http://www.abc.com/dss.htm?a=1&p=2#chk
https://poi.com/a%20b
http://bbb.com
http://roch.com/abc
```
For my noob reference, here is the debug version of the same command above:
```
$ echo -e "\n\na blahlah blah:http://www.abc.com/dss.htm?a=1&p=2#chk - blahblah \"https://poi.com/a%20b\"; (http://bbb.comhttp://roch.com/abc) \n" | perl -dne 'use re "debug" ; while ( my $string = <> ) { print "$string\n"; while ( $string =~ m,(http.*?://([^\s)\"](?!ttp:))+),g ) {print "$&\n"} }'
```
The regex matches on http(s):// - and uses whitespace, " and ) as "exit" characters; then uses positive lookahead to, initially, cause an "exit" on "http" literal group (if a match is already in progress); however, since that also "eats" the last character of previous match, here the lookahead match is moved one character forward to "ttp:".

Some useful pages:
- perl: multiple matches on a single line? (edited for proper < > forma
- regular expression negate a word (not character)
- Perl Regular Expressions
- Perl Text Patterns for Search and Replace (intro, $&, @- ... )
Hope this helps someone,
Cheers!

EDIT: Ups, just found about URI::Find::Simple - search.cpan.org, seems to do the same thing (via regex - Getting the website title from a link in a string)
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...