问题
Is there a web crawler library available for PHP or Ruby? a library that can do it depth first or breadth first... and handle the links even when href="../relative_path.html" and base url is used.
回答1:
Check this page out for a Ruby library: Ruby Mechanize
I'd like to mention that you would still be responsible for the way in which your crawler traverses sites.
回答2:
http://phpcrawl.cuab.de/
回答3:
you can go for webrat or watir in ruby, much easier than mechanize
回答4:
If you'd like to learn basic web crawler & search things, you can start look at "luna engine".
回答5:
If you need to scrape web pages that use javascript you can use Capybara with a driver which will spin up a real browser, such as poltergeist. Its usually used with a testing framework for acceptance testing, but can also be used outside a testing framework.
来源:https://stackoverflow.com/questions/855873/is-there-a-web-crawler-library-available-for-php-or-ruby