Use Ruby Mechanize to scrape all successive pages

问题

I'm looking for assistance on the best way to loop through successive pages on a website while scraping relevant data off of each page.

For example, I want to go to a specific site (craigslist in below example), scrape the data from the first page, go to the next page, scrape all relevant data, etc, until the very last page.

In my script I'm using a while loop since it seemed to make the most sense to me. However, it doesn't appear to be working properly and is only scraping data from the first page.

Can someone familiar with Ruby/Mechanize point me in the right direction on what the best way to accomplish this task is. I've spent countless hours trying to figure this out and feel like I'm missing something very basic.

Thanks in advance for your help.

require 'mechanize'
require 'pry'

# initialze
agent = Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari'}
url = "http://charlotte.craigslist.org/search/rea"
page = agent.get(url)

# Create an empty array to dump contents into
property_results = []

# Scrape all successive pages from craigslist
while page.link_with(:dom_class => "button next") != nil
    next_link = page.link_with(:dom_class => "button next")  
    page.css('ul.rows').map do |d|  
        property_hash = { title: d.at_css('a.result-title.hdrlnk').text }    
        property_results.push(property_hash)    
    end  
    page = next_link.click
end

UPDATE: I found this, but still no dice:

Ruby Mechanize: Follow a Link

@pguardiario

require 'mechanize'
require 'httparty'
require 'pry'

# initialze
agent = Mechanize.new 
url = "http://charlotte.craigslist.org/search/rea"
page = agent.get(url)

#create Empty Array
property_results = []

# Scrape all successive pages from craigslist
while link = page.at('[rel=next]')
  page.css('ul.rows').map do |d|  
    property_hash = { title: d.at_css('a.result-title.hdrlnk').text }    
    property_results.push(property_hash)
  end
    link = page.at('[rel=next]')
    page = agent.get link[:href]
 end
pry(binding)

回答1:

Whenever you see a [rel=next], that's the thing you want to follow:

page = agent.get url
do_something_with page
while link = page.at('[rel=next]')
  page = agent.get link[:href]
  do_something_with page
end

来源：https://stackoverflow.com/questions/40880921/use-ruby-mechanize-to-scrape-all-successive-pages

标签

ruby

mechanize

scrape