Best way to concurrently check urls (for status i.e. 200,301,404) for multiple urls in database

问题

Here's what I'm trying to accomplish. Let's say I have 100,000 urls stored in a database and I want to check each of these for http status and store that status. I want to be able to do this concurrently in a fairly small amount of time.

I was wondering what the best way(s) to do this would be. I thought about using some sort of queue with workers/consumers or some sort of evented model, but I don't really have enough experience to know what would work best in this scenario.

Ideas?

回答1:

Take a look at the very capable Typhoeus and Hydra combo. The two make it very easy to concurrently process multiple URLs.

The "Times" example should get you up and running quickly. In the on_complete block put your code to write your statuses to the DB. You could use a thread to build and maintain the queued requests at a healthy level, or queue a set number, let them all run to completion, then loop for another group. It's up to you.

Paul Dix, the original author, talked about his design goals on his blog.

This is some sample code I wrote to download archived mail lists so I could do local searches. I deliberately removed the URL to keep from subjecting the site to DOS attacks if people start running the code:

#!/usr/bin/env ruby

require 'nokogiri'
require 'addressable/uri'
require 'typhoeus'

BASE_URL = ''

url = Addressable::URI.parse(BASE_URL)
resp = Typhoeus::Request.get(url.to_s)
doc = Nokogiri::HTML(resp.body)

hydra = Typhoeus::Hydra.new(:max_concurrency => 10)
doc.css('a').map{ |n| n['href'] }.select{ |href| href[/\.gz$/] }.each do |gzip|
  gzip_url = url.join(gzip)
  request = Typhoeus::Request.new(gzip_url.to_s)

  request.on_complete do |resp|
    gzip_filename = resp.request.url.split('/').last
    puts "writing #{gzip_filename}"
    File.open("gz/#{gzip_filename}", 'w') do |fo|
      fo.write resp.body
    end  
  end
  puts "queuing #{ gzip }"
  hydra.queue(request)
end

hydra.run

Running the code on my several-year-old MacBook Pro pulled in 76 files totaling 11MB in just under 20 seconds, over wireless to DSL. If you're only doing HEAD requests your throughput will be better. You'll want to mess with the concurrency setting because there is a point where having more concurrent sessions only slow you down and needlessly use resources.

I give it a 8 out of 10; It's got a great beat and I can dance to it.

EDIT:

When checking the remove URLs you can use a HEAD request, or a GET with the If-Modified-Since. They can give you responses you can use to determine the freshness of your URLs.

回答2:

I haven't done anything multithreaded in Ruby, only in Java, but it seems pretty straightforward: http://www.tutorialspoint.com/ruby/ruby_multithreading.htm

From what you described, you don't need any queue and workers (well, I'm sure you can do it that way too, but I doubt you'll get much benefit). Just partition your urls between several threads, and let each thread do each chunk and update the database with the results. E.g., create 100 threads, and give each thread a range of 1000 database rows to process.

You could even just create 100 separate processes and give them rows as arguments, if you'd rather deal with processes than threads.

To get the URL status, I think you do an HTTP HEAD request, which I guess is http://apidock.com/ruby/Net/HTTP/request_head in ruby.

回答3:

The work_queue gem is the easiest way to perform tasks asynchronously and concurrently in your application.

wq = WorkQueue.new 10

urls.each do |url|
  wq.enqueue_b do
    response = Net::HTTP.get_response(uri)
    puts response.code
  end
end

wq.join

来源：https://stackoverflow.com/questions/4832956/best-way-to-concurrently-check-urls-for-status-i-e-200-301-404-for-multiple-u

标签

ruby

database

concurrency

http-status