Use Ruby Mechanize to scrape all successive pages

≯℡__Kan透↙ 提交于 2019-12-13 06:27:58

问题


I'm looking for assistance on the best way to loop through successive pages on a website while scraping relevant data off of each page.

For example, I want to go to a specific site (craigslist in below example), scrape the data from the first page, go to the next page, scrape all relevant data, etc, until the very last page.

In my script I'm using a while loop since it seemed to make the most sense to me. However, it doesn't appear to be working properly and is only scraping data from the first page.

Can someone familiar with Ruby/Mechanize point me in the right direction on what the best way to accomplish this task is. I've spent countless hours trying to figure this out and feel like I'm missing something very basic.

Thanks in advance for your help.

require 'mechanize'
require 'pry'

# initialze
agent = Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari'}
url = "http://charlotte.craigslist.org/search/rea"
page = agent.get(url)

# Create an empty array to dump contents into
property_results = []

# Scrape all successive pages from craigslist
while page.link_with(:dom_class => "button next") != nil
    next_link = page.link_with(:dom_class => "button next")  
    page.css('ul.rows').map do |d|  
        property_hash = { title: d.at_css('a.result-title.hdrlnk').text }    
        property_results.push(property_hash)    
    end  
    page = next_link.click
end 

UPDATE: I found this, but still no dice:

Ruby Mechanize: Follow a Link

@pguardiario

require 'mechanize'
require 'httparty'
require 'pry'

# initialze
agent = Mechanize.new 
url = "http://charlotte.craigslist.org/search/rea"
page = agent.get(url)

#create Empty Array
property_results = []

# Scrape all successive pages from craigslist
while link = page.at('[rel=next]')
  page.css('ul.rows').map do |d|  
    property_hash = { title: d.at_css('a.result-title.hdrlnk').text }    
    property_results.push(property_hash)
  end
    link = page.at('[rel=next]')
    page = agent.get link[:href]
 end
pry(binding)

回答1:


Whenever you see a [rel=next], that's the thing you want to follow:

page = agent.get url
do_something_with page
while link = page.at('[rel=next]')
  page = agent.get link[:href]
  do_something_with page
end


来源:https://stackoverflow.com/questions/40880921/use-ruby-mechanize-to-scrape-all-successive-pages

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!