How do I ignore file types in a web crawler?

半腔热情 提交于 2019-12-01 14:32:22

use URI#path:

unless URI.parse(url).path =~ /\.(\w+)$/ && $exclude.include?($1)
  puts "downloading #{url}..."
end
the Tin Man

Ruby lacks a really useful module that Perl has, called Regexp::Assemble. Ruby's Regexp::Union comes nowhere near it. Here's how to use Regexp::Assemble, and its result:

use Regexp::Assemble;

my @extensions = sort qw(flv swf png jpg gif asx zip rar tar 7z gz jar js css dtd xsd ico raw mp3 mp4 wav wmv ape aac ac3 wma aiff mpg mpeg avi mov ogg mkv mka asx asf mp2 m1v m3u f4v pdf doc xls ppt pps bin exe rss xml);

my $ra = Regexp::Assemble->new;
$ra->add(@extensions);

print $ra->re, "\n";

Which outputs:

(?-xism:(?:m(?:p(?:[234]|e?g)|[1o]v|k[av]|3u)|a(?:s[fx]|iff|ac|c3|pe|vi)|p(?:p[st]|df|ng)|r(?:a[rw]|ss)|w(?:m[av]|av)|x(?:ls|ml|sd)|j(?:ar|pg|s)|d(?:oc|td)|g(?:if|z)|f[4l]v|bin|css|exe|ico|ogg|swf|tar|zip|7z))

Perl supports the s flag and Ruby doesn't, so that needs to be taken out of ?-xism, and we want to ignore character case so the i needs to be moved, resulting in ?i-xm.

Plug that into a Ruby script as the regular expression:

REGEX = /(?i-xm:(?:m(?:p(?:[234]|e?g)|[1o]v|k[av]|3u)|a(?:s[fx]|iff|ac|c3|pe|vi)|p(?:p[st]|df|ng)|r(?:a[rw]|ss)|w(?:m[av]|av)|x(?:ls|ml|sd)|j(?:ar|pg|s)|d(?:oc|td)|g(?:if|z)|f[4l]v|bin|css|exe|ico|ogg|swf|tar|zip|7z))/

@url = URI.parse(url)

puts @url.path[REGEX]

uri = URI.parse('http://foo.com/bar.jpg')
uri.path        # => "/bar.jpg"
uri.path[REGEX] # => "jpg"

See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for more about using Regexp::Assemble from Ruby.

You can strip off the URL's file extension with a regular expression or split (I've shown the latter here, but beware this will also match some malformed URLs, such as http://foo.exe), then use Array#include? to check for membership:

@url = URI.parse(url) unless $exclude.include?(url.split('.').last)
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!