Data scraping multiple page clicks loops

问题

Trying to figure out a way to use one mechanise to scrape and add to arrays all of the data we want from the UCAS website. Currently we're struggling with coding in the link clicks for mechanise. Wondering if anyone can help, there are three successive link clicks amidst loops to progress through all search result pages. The first link to display all courses for university is within div class morecourseslink

the second link to display course names, duration and qual is in div class coursenamearea

the third link is in div coursedetailsshowable and the a id is coursedetailtab_entryreqs

currently we are scraping uninames with the below:

class PagesController < ApplicationController
  def home


require 'mechanize'
mechanize = Mechanize.new

@uninames_array = []

   page = mechanize.get('http://search.ucas.com/search/providers?CountryCode=3&RegionCode=&Lat=&Lng=&Feather=&Vac=2&Query=&ProviderQuery=&AcpId=&Location=scotland&IsFeatherProcessed=True&SubjectCode=&AvailableIn=2016')


page.search('li.result h3').each do |h3|
  name = h3.text
  @uninames_array.push(name)
end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

  page.search('li.result h3').each do |h3|
    name = h3.text
    @uninames_array.push(name)
  end
end

puts @uninames_array.to_s
  end
end

And course names duration and qualification from the below:

require 'mechanize'


mechanize = Mechanize.new
@duration_array = []
@qual_array = []
@courses_array = []

page = mechanize.get('http://search.ucas.com/search/results?Vac=2&AvailableIn=2016&IsFeatherProcessed=True&page=1&providerids=41')


page.search('div.courseinfoduration').each do |x|
puts x.text.strip
page.search('div.courseinfooutcome').each do |y|
puts y.text.strip

end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

page.search('div.courseinfoduration').each do |x|
    name = x
    @duration_array.push(name)
    puts x.text.strip
  end
end
while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

page.search('div.courseinfooutcome').each do |y|
    name = y
    @qual_array.push(name)
    puts y.text.strip
  end
end
page.search('div.coursenamearea h4').each do |h4|
puts h4.text.strip

end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

page.search('div.coursenamearea h4').each do |h4|
    name = h4.text
    @courses_array.push(name)
    puts h4.text.strip
  end
end
end

回答1:

If you want to do this with one Mechanize instance why not just string them all together and store the pages you need to jump to and from in variables?

If all your code works then you can simply string them together into one method call:

def home


  require 'mechanize'
  mechanize = Mechanize.new

  @uninames_array = []

  page = mechanize.get('http://search.ucas.com/search/providers?CountryCode=3&RegionCode=&Lat=&Lng=&Feather=&Vac=2&Query=&ProviderQuery=&AcpId=&Location=scotland&IsFeatherProcessed=True&SubjectCode=&AvailableIn=2016')


  page.search('li.result h3').each do |h3|
    name = h3.text
    @uninames_array.push(name)
  end

  while next_page_link = page.at('.pager a[text()=">"]')
    page = mechanize.get(next_page_link['href'])

    page.search('li.result h3').each do |h3|
      name = h3.text
      @uninames_array.push(name)
    end
  end


@duration_array = []
@qual_array = []
@courses_array = []

page = mechanize.get('http://search.ucas.com/search/results?Vac=2&AvailableIn=2016&IsFeatherProcessed=True&page=1&providerids=41')


page.search('div.courseinfoduration').each do |x|
puts x.text.strip
page.search('div.courseinfooutcome').each do |y|
puts y.text.strip

end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

page.search('div.courseinfoduration').each do |x|
    name = x
    @duration_array.push(name)
    puts x.text.strip
  end
end
while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

page.search('div.courseinfooutcome').each do |y|
    name = y
    @qual_array.push(name)
    puts y.text.strip
  end
end
page.search('div.coursenamearea h4').each do |h4|
puts h4.text.strip

end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

page.search('div.coursenamearea h4').each do |h4|
    name = h4.text
    @courses_array.push(name)
    puts h4.text.strip
  end
end

来源：https://stackoverflow.com/questions/37681359/data-scraping-multiple-page-clicks-loops

标签

ruby-on-rails

ruby

web-scraping

nokogiri

mechanize